Commit Graph

62855 Commits

Author SHA1 Message Date
Linus Torvalds
cff11abeca Kbuild updates for v5.8
- fix warnings in 'make clean' for ARCH=um, hexagon, h8300, unicore32
 
  - ensure to rebuild all objects when the compiler is upgraded
 
  - exclude system headers from dependency tracking and fixdep processing
 
  - fix potential bit-size mismatch between the kernel and BPF user-mode
    helper
 
  - add the new syntax 'userprogs' to build user-space programs for the
    target architecture (the same arch as the kernel)
 
  - compile user-space sample code under samples/ for the target arch
    instead of the host arch
 
  - make headers_install fail if a CONFIG option is leaked to user-space
 
  - sanitize the output format of scripts/checkstack.pl
 
  - handle ARM 'push' instruction in scripts/checkstack.pl
 
  - error out before modpost if a module name conflict is found
 
  - error out when multiple directories are passed to M= because this
    feature is broken for a long time
 
  - add CONFIG_DEBUG_INFO_COMPRESSED to support compressed debug info
 
  - a lot of cleanups of modpost
 
  - dump vmlinux symbols out into vmlinux.symvers, and reuse it in the
    second pass of modpost
 
  - do not run the second pass of modpost if nothing in modules is updated
 
  - install modules.builtin(.modinfo) by 'make install' as well as by
    'make modules_install' because it is useful even when CONFIG_MODULES=n
 
  - add new command line variables, GZIP, BZIP2, LZOP, LZMA, LZ4, and XZ
    to allow users to use alternatives such as pigz, pbzip2, etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl7brm0VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGjeEP/Rrf8H9cp/Tq+ALQCBycI3W5ZEHg
 n2EqprZkVP2MlOV0d+8b9t4PdZf6E5Wmfv26sMaBAhl6X1KQI/0NgPMnTINvy5jJ
 Q2SMhj9y8Gwr3XKFu9Hd/0U+Sax5rz+LmY84tdF95dXzPIUWjAEVnbmN+ofY6T++
 sNf2YGNFSR6iiqr3uCYA0hHZmpKlfhVgDPAdncWa5aadSsuQb79nZQWefGeVEsuD
 HrISpwnkhBc0qY1xyWry6agE92xWmkNkdjKq6A7peguZL02XySWLRWjyHoiiaPOB
 6U4urKs/NSXqPgxGxwZthhwERHryC3+g4s8wRBDKE6ISRWKBBA2ruHpgdF5h/utu
 re1ZP2qRcAt8NBFynr4MEb2AU0mYkv7iEgfLJ7NUCRlMOtqrn5RFwnS4r8ReyQp5
 1UM11RbPhYgYjM5g9hBHJ7nK944/kfvy1/4jF4I1+M5O7QL6f00pu3r2bBIa/65g
 DWrNOpIliKG27GgnRlxi7HgLfxs9etFcXTpHO0ymgnMmlz+7FQsdceR9qqybGU9o
 yBWw6zculMQjb3E+k0DTnE5kLWsycbua921wxM9ABSxRmJi7WciNF73RdLUIBoAY
 VUbwrP2aIpdL+2uyX6RqdTaWzEBpW8omszr46aQ96pX+RiqMrPvJRLaA/tr3ZH8g
 tdHenJPWdHSaOcO4
 =GKe5
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - fix warnings in 'make clean' for ARCH=um, hexagon, h8300, unicore32

 - ensure to rebuild all objects when the compiler is upgraded

 - exclude system headers from dependency tracking and fixdep processing

 - fix potential bit-size mismatch between the kernel and BPF user-mode
   helper

 - add the new syntax 'userprogs' to build user-space programs for the
   target architecture (the same arch as the kernel)

 - compile user-space sample code under samples/ for the target arch
   instead of the host arch

 - make headers_install fail if a CONFIG option is leaked to user-space

 - sanitize the output format of scripts/checkstack.pl

 - handle ARM 'push' instruction in scripts/checkstack.pl

 - error out before modpost if a module name conflict is found

 - error out when multiple directories are passed to M= because this
   feature is broken for a long time

 - add CONFIG_DEBUG_INFO_COMPRESSED to support compressed debug info

 - a lot of cleanups of modpost

 - dump vmlinux symbols out into vmlinux.symvers, and reuse it in the
   second pass of modpost

 - do not run the second pass of modpost if nothing in modules is
   updated

 - install modules.builtin(.modinfo) by 'make install' as well as by
   'make modules_install' because it is useful even when
   CONFIG_MODULES=n

 - add new command line variables, GZIP, BZIP2, LZOP, LZMA, LZ4, and XZ
   to allow users to use alternatives such as pigz, pbzip2, etc.

* tag 'kbuild-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (96 commits)
  kbuild: add variables for compression tools
  Makefile: install modules.builtin even if CONFIG_MODULES=n
  mksysmap: Fix the mismatch of '.L' symbols in System.map
  kbuild: doc: rename LDFLAGS to KBUILD_LDFLAGS
  modpost: change elf_info->size to size_t
  modpost: remove is_vmlinux() helper
  modpost: strip .o from modname before calling new_module()
  modpost: set have_vmlinux in new_module()
  modpost: remove mod->skip struct member
  modpost: add mod->is_vmlinux struct member
  modpost: remove is_vmlinux() call in check_for_{gpl_usage,unused}()
  modpost: remove mod->is_dot_o struct member
  modpost: move -d option in scripts/Makefile.modpost
  modpost: remove -s option
  modpost: remove get_next_text() and make {grab,release_}file static
  modpost: use read_text_file() and get_line() for reading text files
  modpost: avoid false-positive file open error
  modpost: fix potential mmap'ed file overrun in get_src_version()
  modpost: add read_text_file() and get_line() helpers
  modpost: do not call get_modinfo() for vmlinux(.o)
  ...
2020-06-06 12:00:25 -07:00
Linus Torvalds
9daa0a27a0 AFS Changes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7ZC5kACgkQ+7dXa6fL
 C2uv9A/+NKlTSXyv2ZuvtmXADelndcXJ+nC+3bwI7Jh43aa8uCCsAVYD0VE+dxor
 Ingj/LUJ2sjjp6RXCeeqqETXCoCVt0zK2g216+An7k84KJ+ms+MDa8dNN7l6280S
 1jw4hnT0+g9Ln6elgqBroV980MJC2NGL0Eaete8zFO8UqYZy5w1ge0HfGck2l45U
 2lr6egCWYSUPmtFKXJnLV8luwRvq7DzvTk9WrJu3kwOjaY1AQP1+1VpdhChJLrRc
 /4Ddy1On5IXiFrPi5OtHA422bfirUpIv2HbmI047W9uiZ05MiXwSvNS1qJLTa1AA
 T/SK88d3FCeSYw3olAne2kEl9uewvGByr98fDKFOcDHZj18abd9/VtUp33RXxYBy
 lN2wqlWP++LlZ4sMCbbvLXX8OB1tekQzWQC0vJ5rhRSgveOlhL9TLG2Y05xokFs+
 AwK8zTlDIZ6Pa/JIHfp2E0ZhXEazWTSmP+d7NkgzF0iiORukvsmxjOVUZC4+UCqK
 rYN6goJ5g8qpejRv5NhfP6/olb1NK33f/F2QSSFfxv9zda4HNlayvcoSnFrdUEnt
 IfBhSKPkeDVWs1yse7glDuw19tHp94B9UYwJ46qfHngQPArgy+gp23d0cSy41Pr5
 FRQ23eNvBWIP4srt1gSCBexSGA1h/ACji41CPTJbF2jg5uWFAUE=
 =YVwD
 -----END PGP SIGNATURE-----

Merge tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull AFS updates from David Howells:
 "There's some core VFS changes which affect a couple of filesystems:

   - Make the inode hash table RCU safe and providing some RCU-safe
     accessor functions. The search can then be done without taking the
     inode_hash_lock. Care must be taken because the object may be being
     deleted and no wait is made.

   - Allow iunique() to avoid taking the inode_hash_lock.

   - Allow AFS's callback processing to avoid taking the inode_hash_lock
     when using the inode table to find an inode to notify.

   - Improve Ext4's time updating. Konstantin Khlebnikov said "For now,
     I've plugged this issue with try-lock in ext4 lazy time update.
     This solution is much better."

  Then there's a set of changes to make a number of improvements to the
  AFS driver:

   - Improve callback (ie. third party change notification) processing
     by:

      (a) Relying more on the fact we're doing this under RCU and by
          using fewer locks. This makes use of the RCU-based inode
          searching outlined above.

      (b) Moving to keeping volumes in a tree indexed by volume ID
          rather than a flat list.

      (c) Making the server and volume records logically part of the
          cell. This means that a server record now points directly at
          the cell and the tree of volumes is there. This removes an N:M
          mapping table, simplifying things.

   - Improve keeping NAT or firewall channels open for the server
     callbacks to reach the client by actively polling the fileserver on
     a timed basis, instead of only doing it when we have an operation
     to process.

   - Improving detection of delayed or lost callbacks by including the
     parent directory in the list of file IDs to be queried when doing a
     bulk status fetch from lookup. We can then check to see if our copy
     of the directory has changed under us without us getting notified.

   - Determine aliasing of cells (such as a cell that is pointed to be a
     DNS alias). This allows us to avoid having ambiguity due to
     apparently different cells using the same volume and file servers.

   - Improve the fileserver rotation to do more probing when it detects
     that all of the addresses to a server are listed as non-responsive.
     It's possible that an address that previously stopped responding
     has become responsive again.

  Beyond that, lay some foundations for making some calls asynchronous:

   - Turn the fileserver cursor struct into a general operation struct
     and hang the parameters off of that rather than keeping them in
     local variables and hang results off of that rather than the call
     struct.

   - Implement some general operation handling code and simplify the
     callers of operations that affect a volume or a volume component
     (such as a file). Most of the operation is now done by core code.

   - Operations are supplied with a table of operations to issue
     different variants of RPCs and to manage the completion, where all
     the required data is held in the operation object, thereby allowing
     these to be called from a workqueue.

   - Put the standard "if (begin), while(select), call op, end" sequence
     into a canned function that just emulates the current behaviour for
     now.

  There are also some fixes interspersed:

   - Don't let the EACCES from ICMP6 mapping reach the user as such,
     since it's confusing as to whether it's a filesystem error. Convert
     it to EHOSTUNREACH.

   - Don't use the epoch value acquired through probing a server. If we
     have two servers with the same UUID but in different cells, it's
     hard to draw conclusions from them having different epoch values.

   - Don't interpret the argument to the CB.ProbeUuid RPC as a
     fileserver UUID and look up a fileserver from it.

   - Deal with servers in different cells having the same UUIDs. In the
     event that a CB.InitCallBackState3 RPC is received, we have to
     break the callback promises for every server record matching that
     UUID.

   - Don't let afs_statfs return values that go below 0.

   - Don't use running fileserver probe state to make server selection
     and address selection decisions on. Only make decisions on final
     state as the running state is cleared at the start of probing"

Acked-by: Al Viro <viro@zeniv.linux.org.uk> (fs/inode.c part)

* tag 'afs-next-20200604' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (27 commits)
  afs: Adjust the fileserver rotation algorithm to reprobe/retry more quickly
  afs: Show more a bit more server state in /proc/net/afs/servers
  afs: Don't use probe running state to make decisions outside probe code
  afs: Fix afs_statfs() to not let the values go below zero
  afs: Fix the by-UUID server tree to allow servers with the same UUID
  afs: Reorganise volume and server trees to be rooted on the cell
  afs: Add a tracepoint to track the lifetime of the afs_volume struct
  afs: Detect cell aliases 3 - YFS Cells with a canonical cell name op
  afs: Detect cell aliases 2 - Cells with no root volumes
  afs: Detect cell aliases 1 - Cells with root volumes
  afs: Implement client support for the YFSVL.GetCellName RPC op
  afs: Retain more of the VLDB record for alias detection
  afs: Fix handling of CB.ProbeUuid cache manager op
  afs: Don't get epoch from a server because it may be ambiguous
  afs: Build an abstraction around an "operation" concept
  afs: Rename struct afs_fs_cursor to afs_operation
  afs: Remove the error argument from afs_protocol_error()
  afs: Set error flag rather than return error from file status decode
  afs: Make callback processing more efficient.
  afs: Show more information in /proc/net/afs/servers
  ...
2020-06-05 16:26:36 -07:00
Linus Torvalds
242b233198 RDMA 5.8 merge window pull request
A few large, long discussed works this time. The RNBD block driver has
 been posted for nearly two years now, and the removal of FMR has been a
 recurring discussion theme for a long time. The usual smattering of
 features and bug fixes.
 
 - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa
 
 - Continuing driver cleanups in bnxt_re, hns
 
 - Big cleanup of mlx5 QP creation flows
 
 - More consistent use of src port and flow label when LAG is used and a
   mlx5 implementation
 
 - Additional set of cleanups for IB CM
 
 - 'RNBD' network block driver and target. This is a network block RDMA
   device specific to ionos's cloud environment. It brings strong multipath
   and resiliency capabilities.
 
 - Accelerated IPoIB for HFI1
 
 - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple async fds
 
 - Support for exchanging the new IBTA defiend ECE data during RDMA CM
   exchanges
 
 - Removal of the very old and insecure FMR interface from all ULPs and
   drivers. FRWR should be preferred for at least a decade now.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAl7X/IwACgkQOG33FX4g
 mxp2uw/+MI2S/aXqEBvZfTT8yrkAwqYezS0VeTDnwH/T6UlTMDhHVN/2Ji3tbbX3
 FEKT1i2mnAL5RqUAL1lr9g4sG/bVozrpN46Ws5Lu9dTbIPLKTNPWDuLFQDUShKY7
 OyMI/bRx6anGnsOy20iiBqnrQbrrZj5TECgnmrkAl62QFdcl7aBWe/yYjy4CT11N
 ub+aBXBREN1F1pc0HIjd2tI+8gnZc+mNm1LVVDRH9Capun/pI26qDNh7e6QwGyIo
 n8ItraC8znLwv/nsUoTE7/JRcsTEe6vJI26PQmczZfNJs/4O65G7fZg0eSBseZYi
 qKf7Uwtb3qW0R7jRUMEgFY4DKXVAA0G2ph40HXBuzOSsqlT6HqYMO2wgG8pJkrTc
 qAjoSJGzfAHIsjxzxKI8wKuufCddjCm30VWWU7EKeriI6h1J0uPVqKkQMfYBTkik
 696eZSBycAVgwayOng3XaehiTxOL7qGMTjUpDjUR6UscbiPG919vP+QsbIUuBXdb
 YoddBQJdyGJiaCXv32ciJjo9bjPRRi/bII7Q5qzCNI2mi4ZVbudF4ffzyQvdHtNJ
 nGnpRXoPi7kMvUrKTMPWkFjj0R5/UsPszsA51zbxPydfgBe0Dlc2PrrIG8dlzYAp
 wbV0Lec+iJucKlt7EZtrjz1xOiOOaQt/5/cW1bWqL+wk2t6gAuY=
 =9zTe
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull rdma updates from Jason Gunthorpe:
 "A more active cycle than most of the recent past, with a few large,
  long discussed works this time.

  The RNBD block driver has been posted for nearly two years now, and
  flowing through RDMA due to it also introducing a new ULP.

  The removal of FMR has been a recurring discussion theme for a long
  time.

  And the usual smattering of features and bug fixes.

  Summary:

   - Various small driver bugs fixes in rxe, mlx5, hfi1, and efa

   - Continuing driver cleanups in bnxt_re, hns

   - Big cleanup of mlx5 QP creation flows

   - More consistent use of src port and flow label when LAG is used and
     a mlx5 implementation

   - Additional set of cleanups for IB CM

   - 'RNBD' network block driver and target. This is a network block
     RDMA device specific to ionos's cloud environment. It brings strong
     multipath and resiliency capabilities.

   - Accelerated IPoIB for HFI1

   - QP/WQ/SRQ ioctl migration for uverbs, and support for multiple
     async fds

   - Support for exchanging the new IBTA defiend ECE data during RDMA CM
     exchanges

   - Removal of the very old and insecure FMR interface from all ULPs
     and drivers. FRWR should be preferred for at least a decade now"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (247 commits)
  RDMA/cm: Spurious WARNING triggered in cm_destroy_id()
  RDMA/mlx5: Return ECE DC support
  RDMA/mlx5: Don't rely on FW to set zeros in ECE response
  RDMA/mlx5: Return an error if copy_to_user fails
  IB/hfi1: Use free_netdev() in hfi1_netdev_free()
  RDMA/hns: Uninitialized variable in modify_qp_init_to_rtr()
  RDMA/core: Move and rename trace_cm_id_create()
  IB/hfi1: Fix hfi1_netdev_rx_init() error handling
  RDMA: Remove 'max_map_per_fmr'
  RDMA: Remove 'max_fmr'
  RDMA/core: Remove FMR device ops
  RDMA/rdmavt: Remove FMR memory registration
  RDMA/mthca: Remove FMR support for memory registration
  RDMA/mlx4: Remove FMR support for memory registration
  RDMA/i40iw: Remove FMR leftovers
  RDMA/bnxt_re: Remove FMR leftovers
  RDMA/mlx5: Remove FMR leftovers
  RDMA/core: Remove FMR pool API
  RDMA/rds: Remove FMR support for memory registration
  RDMA/srp: Remove support for FMR memory registration
  ...
2020-06-05 14:05:57 -07:00
Stefano Garzarella
fdb4276aae vsock/vmci: make vmci_vsock_transport_cb() static
Fix the following gcc-9.3 warning when building with 'make W=1':
    net/vmw_vsock/vmci_transport.c:2058:6: warning: no previous prototype
        for ‘vmci_vsock_transport_cb’ [-Wmissing-prototypes]
     2058 | void vmci_vsock_transport_cb(bool is_host)
          |      ^~~~~~~~~~~~~~~~~~~~~~~

Fixes: b1bba80a43 ("vsock/vmci: register vmci_transport only when VMCI guest/host are active")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-05 13:18:26 -07:00
Dan Carpenter
178f67b128 ethtool: linkinfo: remove an unnecessary NULL check
This code generates a Smatch warning:

    net/ethtool/linkinfo.c:143 ethnl_set_linkinfo()
    warn: variable dereferenced before check 'info' (see line 119)

Fortunately, the "info" pointer is never NULL so the check can be
removed.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-05 13:10:02 -07:00
David Howells
5ac0d62226 rxrpc: Fix missing notification
Under some circumstances, rxrpc will fail a transmit a packet through the
underlying UDP socket (ie. UDP sendmsg returns an error).  This may result
in a call getting stuck.

In the instance being seen, where AFS tries to send a probe to the Volume
Location server, tracepoints show the UDP Tx failure (in this case returing
error 99 EADDRNOTAVAIL) and then nothing more:

 afs_make_vl_call: c=0000015d VL.GetCapabilities
 rxrpc_call: c=0000015d NWc u=1 sp=rxrpc_kernel_begin_call+0x106/0x170 [rxrpc] a=00000000dd89ee8a
 rxrpc_call: c=0000015d Gus u=2 sp=rxrpc_new_client_call+0x14f/0x580 [rxrpc] a=00000000e20e4b08
 rxrpc_call: c=0000015d SEE u=2 sp=rxrpc_activate_one_channel+0x7b/0x1c0 [rxrpc] a=00000000e20e4b08
 rxrpc_call: c=0000015d CON u=2 sp=rxrpc_kernel_begin_call+0x106/0x170 [rxrpc] a=00000000e20e4b08
 rxrpc_tx_fail: c=0000015d r=1 ret=-99 CallDataNofrag

The problem is that if the initial packet fails and the retransmission
timer hasn't been started, the call is set to completed and an error is
returned from rxrpc_send_data_packet() to rxrpc_queue_packet().  Though
rxrpc_instant_resend() is called, this does nothing because the call is
marked completed.

So rxrpc_notify_socket() isn't called and the error is passed back up to
rxrpc_send_data(), rxrpc_kernel_send_data() and thence to afs_make_call()
and afs_vl_get_capabilities() where it is simply ignored because it is
assumed that the result of a probe will be collected asynchronously.

Fileserver probing is similarly affected via afs_fs_get_capabilities().

Fix this by always issuing a notification in __rxrpc_set_call_completion()
if it shifts a call to the completed state, even if an error is also
returned to the caller through the function return value.

Also put in a little bit of optimisation to avoid taking the call
state_lock and disabling softirqs if the call is already in the completed
state and remove some now redundant rxrpc_notify_socket() calls.

Fixes: f5c17aaeb2 ("rxrpc: Calls should only have one terminal state")
Reported-by: Gerry Seidman <gerry@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
2020-06-05 13:36:35 +01:00
David Howells
3067bf8c59 rxrpc: Move the call completion handling out of line
Move the handling of call completion out of line so that the next patch can
add more code in that area.

Signed-off-by: David Howells <dhowells@redhat.com>
Reviewed-by: Marc Dionne <marc.dionne@auristor.com>
2020-06-05 13:36:35 +01:00
Johannes Berg
523f3ec030 mac80211: initialize return flags in HE 6 GHz operation parsing
Dan points out that if ieee80211_chandef_he_6ghz_oper() succeeds,
we don't initialize 'ret'. Initialize it to 0 in this case, since
everything went fine and nothing has to be disabled.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Fixes: 57fa5e85d5 ("mac80211: determine chandef from HE 6 GHz operation")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200603111500.bd2a5ff37b83.I2c3f338ce343b581db493eb9a0d988d1b626c8fb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-05 14:33:51 +02:00
Johannes Berg
79ea1e12c0 cfg80211: fix management registrations deadlock
Lockdep reports that we may deadlock because we take the RTNL on
the work struct, but flush it under RTNL. Clearly, it's correct.
In practice, this can happen when doing rfkill on an active device.

Fix this by moving the work struct to the wiphy (registered dev)
layer, and iterate over all the wdevs inside there. This then
means we need to track which one of them has work to do, so we
don't update to the driver for all wdevs all the time.

Also fix a locking bug I noticed while working on this - the
registrations list is iterated as if it was an RCU list, but it
isn't handle that way - and we need to lock now for the update
flag anyway, so remove the RCU.

Fixes: 6cd536fe62 ("cfg80211: change internal management frame registration API")
Reported-by: Markus Theil <markus.theil@tu-ilmenau.de>
Reported-and-tested-by: Kenneth R. Crudup <kenny@panix.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200604120420.b1dc540a7e26.I55dcca56bb5bdc5d7ad66a36a0b42afd7034d8be@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-06-05 09:22:00 +02:00
Stephen Rothwell
a4902d914e xfrm: merge fixup for "remove output_finish indirection from xfrm_state_afinfo"
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-06-05 08:10:08 +02:00
Pavel Machek
7d877c35ca net/xdp: use shift instead of 64 bit division
64bit division is kind of expensive, and shift should do the job here.

Signed-off-by: Pavel Machek (CIP) <pavel@denx.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 16:02:58 -07:00
Paolo Abeni
6761893eea inet_connection_sock: clear inet_num out of destroy helper
Clearing the 'inet_num' field is necessary and safe if and
only if the socket is not bound. The MPTCP protocol calls
the destroy helper on bound sockets, as tcp_v{4,6}_syn_recv_sock
completed successfully.

Move the clearing of such field out of the common code, otherwise
the MPTCP MP_JOIN error path will find the wrong 'inet_num' value
on socket disposal, __inet_put_port() will acquire the wrong lock
and bind_node removal could race with other modifiers possibly
corrupting the bind hash table.

Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Fixes: 729cd6436f ("mptcp: cope better with MP_JOIN failure")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:59:56 -07:00
Ahmed S. Darwish
11d6011c2c net: core: device_rename: Use rwsem instead of a seqcount
Sequence counters write paths are critical sections that must never be
preempted, and blocking, even for CONFIG_PREEMPTION=n, is not allowed.

Commit 5dbe7c178d ("net: fix kernel deadlock with interface rename and
netdev name retrieval.") handled a deadlock, observed with
CONFIG_PREEMPTION=n, where the devnet_rename seqcount read side was
infinitely spinning: it got scheduled after the seqcount write side
blocked inside its own critical section.

To fix that deadlock, among other issues, the commit added a
cond_resched() inside the read side section. While this will get the
non-preemptible kernel eventually unstuck, the seqcount reader is fully
exhausting its slice just spinning -- until TIF_NEED_RESCHED is set.

The fix is also still broken: if the seqcount reader belongs to a
real-time scheduling policy, it can spin forever and the kernel will
livelock.

Disabling preemption over the seqcount write side critical section will
not work: inside it are a number of GFP_KERNEL allocations and mutex
locking through the drivers/base/ :: device_rename() call chain.

>From all the above, replace the seqcount with a rwsem.

Fixes: 5dbe7c178d (net: fix kernel deadlock with interface rename and netdev name retrieval.)
Fixes: 30e6c9fa93 (net: devnet_rename_seq should be a seqcount)
Fixes: c91f6df2db (sockopt: Change getsockopt() of SO_BINDTODEVICE to return an interface name)
Cc: <stable@vger.kernel.org>
Reported-by: kbuild test robot <lkp@intel.com> [ v1 missing up_read() on error exit ]
Reported-by: Dan Carpenter <dan.carpenter@oracle.com> [ v1 missing up_read() on error exit ]
Signed-off-by: Ahmed S. Darwish <a.darwish@linutronix.de>
Reviewed-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:50:42 -07:00
Ahmed Abdelsalam
bb986a5042 seg6: fix seg6_validate_srh() to avoid slab-out-of-bounds
The seg6_validate_srh() is used to validate SRH for three cases:

case1: SRH of data-plane SRv6 packets to be processed by the Linux kernel.
Case2: SRH of the netlink message received  from user-space (iproute2)
Case3: SRH injected into packets through setsockopt

In case1, the SRH can be encoded in the Reduced way (i.e., first SID is
carried in DA only and not represented as SID in the SRH) and the
seg6_validate_srh() now handles this case correctly.

In case2 and case3, the SRH shouldn’t be encoded in the Reduced way
otherwise we lose the first segment (i.e., the first hop).

The current implementation of the seg6_validate_srh() allow SRH of case2
and case3 to be encoded in the Reduced way. This leads a slab-out-of-bounds
problem.

This patch verifies SRH of case1, case2 and case3. Allowing case1 to be
reduced while preventing SRH of case2 and case3 from being reduced .

Reported-by: syzbot+e8c028b62439eac42073@syzkaller.appspotmail.com
Reported-by: YueHaibing <yuehaibing@huawei.com>
Fixes: 0cb7498f23 ("seg6: fix SRH processing to comply with RFC8754")
Signed-off-by: Ahmed Abdelsalam <ahabdels@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:39:32 -07:00
Tuong Lien
5e9eeccc58 tipc: fix NULL pointer dereference in streaming
syzbot found the following crash:

general protection fault, probably for non-canonical address 0xdffffc0000000019: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000000c8-0x00000000000000cf]
CPU: 1 PID: 7060 Comm: syz-executor394 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__tipc_sendstream+0xbde/0x11f0 net/tipc/socket.c:1591
Code: 00 00 00 00 48 39 5c 24 28 48 0f 44 d8 e8 fa 3e db f9 48 b8 00 00 00 00 00 fc ff df 48 8d bb c8 00 00 00 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 e2 04 00 00 48 8b 9b c8 00 00 00 48 b8 00 00 00
RSP: 0018:ffffc90003ef7818 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff8797fd9d
RDX: 0000000000000019 RSI: ffffffff8797fde6 RDI: 00000000000000c8
RBP: ffff888099848040 R08: ffff88809a5f6440 R09: fffffbfff1860b4c
R10: ffffffff8c305a5f R11: fffffbfff1860b4b R12: ffff88809984857e
R13: 0000000000000000 R14: ffff888086aa4000 R15: 0000000000000000
FS:  00000000009b4880(0000) GS:ffff8880ae700000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000140 CR3: 00000000a7fdf000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tipc_sendstream+0x4c/0x70 net/tipc/socket.c:1533
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x32f/0x810 net/socket.c:2352
 ___sys_sendmsg+0x100/0x170 net/socket.c:2406
 __sys_sendmmsg+0x195/0x480 net/socket.c:2496
 __do_sys_sendmmsg net/socket.c:2525 [inline]
 __se_sys_sendmmsg net/socket.c:2522 [inline]
 __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2522
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x440199
...

This bug was bisected to commit 0a3e060f34 ("tipc: add test for Nagle
algorithm effectiveness"). However, it is not the case, the trouble was
from the base in the case of zero data length message sending, we would
unexpectedly make an empty 'txq' queue after the 'tipc_msg_append()' in
Nagle mode.

A similar crash can be generated even without the bisected patch but at
the link layer when it accesses the empty queue.

We solve the issues by building at least one buffer to go with socket's
header and an optional data section that may be empty like what we had
with the 'tipc_msg_build()'.

Note: the previous commit 4c21daae3d ("tipc: Fix NULL pointer
dereference in __tipc_sendstream()") is obsoleted by this one since the
'txq' will be never empty and the check of 'skb != NULL' is unnecessary
but it is safe anyway.

Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com
Fixes: c0bceb97db ("tipc: add smart nagle feature")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:37:59 -07:00
Cong Wang
c36f055591 genetlink: fix memory leaks in genl_family_rcv_msg_dumpit()
There are two kinds of memory leaks in genl_family_rcv_msg_dumpit():

1. Before we call ops->start(), whenever an error happens, we forget
   to free the memory allocated in genl_family_rcv_msg_dumpit().

2. When ops->start() fails, the 'info' has been already installed on
   the per socket control block, so we should not free it here. More
   importantly, nlk->cb_running is still false at this point, so
   netlink_sock_destruct() cannot free it either.

The first kind of memory leaks is easier to resolve, but the second
one requires some deeper thoughts.

After reviewing how netfilter handles this, the most elegant solution
I find is just to use a similar way to allocate the memory, that is,
moving memory allocations from caller into ops->start(). With this,
we can solve both kinds of memory leaks: for 1), no memory allocation
happens before ops->start(); for 2), ops->start() handles its own
failures and 'info' is installed to the socket control block only
when success. The only ugliness here is we have to pass all local
variables on stack via a struct, but this is not hard to understand.

Alternatively, we can introduce a ops->free() to solve this too,
but it is overkill as only genetlink has this problem so far.

Fixes: 1927f41a22 ("net: genetlink: introduce dump info struct to be available during dumpit op")
Reported-by: syzbot+21f04f481f449c8db840@syzkaller.appspotmail.com
Cc: "Jason A. Donenfeld" <Jason@zx2c4.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: YueHaibing <yuehaibing@huawei.com>
Cc: Shaochun Chen <cscnull@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-04 15:33:45 -07:00
Linus Torvalds
9ff7258575 Merge branch 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull proc updates from Eric Biederman:
 "This has four sets of changes:

   - modernize proc to support multiple private instances

   - ensure we see the exit of each process tid exactly

   - remove has_group_leader_pid

   - use pids not tasks in posix-cpu-timers lookup

  Alexey updated proc so each mount of proc uses a new superblock. This
  allows people to actually use mount options with proc with no fear of
  messing up another mount of proc. Given the kernel's internal mounts
  of proc for things like uml this was a real problem, and resulted in
  Android's hidepid mount options being ignored and introducing security
  issues.

  The rest of the changes are small cleanups and fixes that came out of
  my work to allow this change to proc. In essence it is swapping the
  pids in de_thread during exec which removes a special case the code
  had to handle. Then updating the code to stop handling that special
  case"

* 'proc-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
  proc: proc_pid_ns takes super_block as an argument
  remove the no longer needed pid_alive() check in __task_pid_nr_ns()
  posix-cpu-timers: Replace __get_task_for_clock with pid_for_clock
  posix-cpu-timers: Replace cpu_timer_pid_type with clock_pid_type
  posix-cpu-timers: Extend rcu_read_lock removing task_struct references
  signal: Remove has_group_leader_pid
  exec: Remove BUG_ON(has_group_leader_pid)
  posix-cpu-timer:  Unify the now redundant code in lookup_task
  posix-cpu-timer: Tidy up group_leader logic in lookup_task
  proc: Ensure we see the exit of each process tid exactly once
  rculist: Add hlists_swap_heads_rcu
  proc: Use PIDTYPE_TGID in next_tgid
  Use proc_pid_ns() to get pid_namespace from the proc superblock
  proc: use named enums for better readability
  proc: use human-readable values for hidepid
  docs: proc: add documentation for "hidepid=4" and "subset=pid" options and new mount behavior
  proc: add option to mount only a pids subset
  proc: instantiate only pids that we can ptrace on 'hidepid=4' mount option
  proc: allow to mount many instances of proc in one pid namespace
  proc: rename struct proc_fs_info to proc_fs_opts
2020-06-04 13:54:34 -07:00
Matthieu Baerts
49b2357594 bpf: Fix unused-var without NETDEVICES
A recent commit added new variables only used if CONFIG_NETDEVICES is
set. A simple fix would be to only declare these variables if the same
condition is valid but Alexei suggested an even simpler solution:

    since CONFIG_NETDEVICES doesn't change anything in .h I think the
    best is to remove #ifdef CONFIG_NETDEVICES from net/core/filter.c
    and rely on sock_bindtoindex() returning ENOPROTOOPT in the extreme
    case of oddly configured kernels.

Fixes: 70c58997c1 ("bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt")
Suggested-by: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200603190347.2310320-1-matthieu.baerts@tessares.net
2020-06-04 22:52:10 +02:00
Huy Nguyen
94579ac3f6 xfrm: Fix double ESP trailer insertion in IPsec crypto offload.
During IPsec performance testing, we see bad ICMP checksum. The error packet
has duplicated ESP trailer due to double validate_xmit_xfrm calls. The first call
is from ip_output, but the packet cannot be sent because
netif_xmit_frozen_or_stopped is true and the packet gets dev_requeue_skb. The second
call is from NET_TX softirq. However after the first call, the packet already
has the ESP trailer.

Fix by marking the skb with XFRM_XMIT bit after the packet is handled by
validate_xmit_xfrm to avoid duplicate ESP trailer insertion.

Fixes: f6e27114a6 ("net: Add a xfrm validate function to validate_xmit_skb")
Signed-off-by: Huy Nguyen <huyn@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-06-04 10:45:14 +02:00
Linus Torvalds
cb8e59cc87 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz
    Augusto von Dentz.

 2) Add GSO partial support to igc, from Sasha Neftin.

 3) Several cleanups and improvements to r8169 from Heiner Kallweit.

 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a
    device self-test. From Andrew Lunn.

 5) Start moving away from custom driver versions, use the globally
    defined kernel version instead, from Leon Romanovsky.

 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin.

 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet.

 8) Add sriov and vf support to hinic, from Luo bin.

 9) Support Media Redundancy Protocol (MRP) in the bridging code, from
    Horatiu Vultur.

10) Support netmap in the nft_nat code, from Pablo Neira Ayuso.

11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina
    Dubroca. Also add ipv6 support for espintcp.

12) Lots of ReST conversions of the networking documentation, from Mauro
    Carvalho Chehab.

13) Support configuration of ethtool rxnfc flows in bcmgenet driver,
    from Doug Berger.

14) Allow to dump cgroup id and filter by it in inet_diag code, from
    Dmitry Yakunin.

15) Add infrastructure to export netlink attribute policies to
    userspace, from Johannes Berg.

16) Several optimizations to sch_fq scheduler, from Eric Dumazet.

17) Fallback to the default qdisc if qdisc init fails because otherwise
    a packet scheduler init failure will make a device inoperative. From
    Jesper Dangaard Brouer.

18) Several RISCV bpf jit optimizations, from Luke Nelson.

19) Correct the return type of the ->ndo_start_xmit() method in several
    drivers, it's netdev_tx_t but many drivers were using
    'int'. From Yunjian Wang.

20) Add an ethtool interface for PHY master/slave config, from Oleksij
    Rempel.

21) Add BPF iterators, from Yonghang Song.

22) Add cable test infrastructure, including ethool interfaces, from
    Andrew Lunn. Marvell PHY driver is the first to support this
    facility.

23) Remove zero-length arrays all over, from Gustavo A. R. Silva.

24) Calculate and maintain an explicit frame size in XDP, from Jesper
    Dangaard Brouer.

25) Add CAP_BPF, from Alexei Starovoitov.

26) Support terse dumps in the packet scheduler, from Vlad Buslov.

27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei.

28) Add devm_register_netdev(), from Bartosz Golaszewski.

29) Minimize qdisc resets, from Cong Wang.

30) Get rid of kernel_getsockopt and kernel_setsockopt in order to
    eliminate set_fs/get_fs calls. From Christoph Hellwig.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits)
  selftests: net: ip_defrag: ignore EPERM
  net_failover: fixed rollback in net_failover_open()
  Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
  Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
  vmxnet3: allow rx flow hash ops only when rss is enabled
  hinic: add set_channels ethtool_ops support
  selftests/bpf: Add a default $(CXX) value
  tools/bpf: Don't use $(COMPILE.c)
  bpf, selftests: Use bpf_probe_read_kernel
  s390/bpf: Use bcr 0,%0 as tail call nop filler
  s390/bpf: Maintain 8-byte stack alignment
  selftests/bpf: Fix verifier test
  selftests/bpf: Fix sample_cnt shared between two threads
  bpf, selftests: Adapt cls_redirect to call csum_level helper
  bpf: Add csum_level helper for fixing up csum levels
  bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
  sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf()
  crypto/chtls: IPv6 support for inline TLS
  Crypto/chcr: Fixes a coccinile check error
  Crypto/chcr: Fixes compilations warnings
  ...
2020-06-03 16:27:18 -07:00
Linus Torvalds
ae03c53d00 Merge branch 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull splice updates from Al Viro:
 "Christoph's assorted splice cleanups"

* 'work.splice' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: rename pipe_buf ->steal to ->try_steal
  fs: make the pipe_buf_operations ->confirm operation optional
  fs: make the pipe_buf_operations ->steal operation optional
  trace: remove tracing_pipe_buf_ops
  pipe: merge anon_pipe_buf*_ops
  fs: simplify do_splice_from
  fs: simplify do_splice_to
2020-06-03 15:52:19 -07:00
Linus Torvalds
e7c93cbfe9 threads-v5.8
-----BEGIN PGP SIGNATURE-----
 
 iHUEABYKAB0WIQRAhzRXHqcMeLMyaSiRxhvAZXjcogUCXtYhfgAKCRCRxhvAZXjc
 oghSAP9uVX3vxYtEtNvu9WtEn1uYZcSKZoF1YrcgY7UfSmna0gEAruzyZcai4CJL
 WKv+4aRq2oYk+hsqZDycAxIsEgWvNg8=
 =ZWj3
 -----END PGP SIGNATURE-----

Merge tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux

Pull thread updates from Christian Brauner:
 "We have been discussing using pidfds to attach to namespaces for quite
  a while and the patches have in one form or another already existed
  for about a year. But I wanted to wait to see how the general api
  would be received and adopted.

  This contains the changes to make it possible to use pidfds to attach
  to the namespaces of a process, i.e. they can be passed as the first
  argument to the setns() syscall.

  When only a single namespace type is specified the semantics are
  equivalent to passing an nsfd. That means setns(nsfd, CLONE_NEWNET)
  equals setns(pidfd, CLONE_NEWNET).

  However, when a pidfd is passed, multiple namespace flags can be
  specified in the second setns() argument and setns() will attach the
  caller to all the specified namespaces all at once or to none of them.

  Specifying 0 is not valid together with a pidfd. Here are just two
  obvious examples:

    setns(pidfd, CLONE_NEWPID | CLONE_NEWNS | CLONE_NEWNET);
    setns(pidfd, CLONE_NEWUSER);

  Allowing to also attach subsets of namespaces supports various
  use-cases where callers setns to a subset of namespaces to retain
  privilege, perform an action and then re-attach another subset of
  namespaces.

  Apart from significantly reducing the number of syscalls needed to
  attach to all currently supported namespaces (eight "open+setns"
  sequences vs just a single "setns()"), this also allows atomic setns
  to a set of namespaces, i.e. either attaching to all namespaces
  succeeds or we fail without having changed anything.

  This is centered around a new internal struct nsset which holds all
  information necessary for a task to switch to a new set of namespaces
  atomically. Fwiw, with this change a pidfd becomes the only token
  needed to interact with a container. I'm expecting this to be
  picked-up by util-linux for nsenter rather soon.

  Associated with this change is a shiny new test-suite dedicated to
  setns() (for pidfds and nsfds alike)"

* tag 'threads-v5.8' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
  selftests/pidfd: add pidfd setns tests
  nsproxy: attach to namespaces via pidfds
  nsproxy: add struct nsset
2020-06-03 13:12:57 -07:00
Alain Michaud
3096524226 Bluetooth: Removing noisy dbg message
This patch removes a particularly noisy dbg message.  The debug message
isn't particularly interesting for debuggability so it was simply
removed to reduce noise in dbg logs.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-06-03 19:53:25 +02:00
Manish Mandlik
6fb00d4e94 Bluetooth: Check scan state before disabling during suspend
Check current scan state by checking HCI_LE_SCAN flag and send scan
disable command only if scan is already enabled.

Signed-off-by: Manish Mandlik <mmandlik@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-06-03 19:51:49 +02:00
Linus Torvalds
9d99b1647f audit/stable-5.8 PR 20200601
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl7VnKEUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXMbHA/+PQmrPdzPvkLAjjf1y3LXvyEIAXIQ
 h2r8SxHa7iGyF6vVPz+ya7ux0KAm8wCVdfkokWG5jxjwK7pysS6gx9JzBVK7dbhD
 FsKBSoq9+to9fYlaCyX7vn85C7kK5oGrwS/ECos0BHBpij8ukLgvPQu+PDs7d4xW
 1X2Nrgqnc7M4L8ayzXTQX0fDWcOkapzaN86+R+Lavb4hO/FownaYbuCFn+1mdzux
 ZNBpt3/y1pM6vi5YBkI1rkauBCmkl/YSX/mf/EwDNlQ0XmcadGQ6z7iwjyiE826g
 etCHWD3cgQH7Zzz6CxBNX8Xbq0nIQueHHiFYpVyy9lf4xleFvnfFDebrs8Q9TB6G
 jTWU8okioUKPZyRDaRuIAmCf/LBQRsMkIYTU3w6J0ZqsBycTw3NXPiQArmlxZESM
 HquxWpKoZytRiw581hiSGKNqY+R3FvA+Jroc/7bWfNOE3IdFxegvCsC3giKJf1rY
 AlQitehql9a5jp7A57+477WRYOygYRnd+ntMD5KqR90QSIcQXeg0/lFKhco+zc2p
 bXbWLE+aaOTGCeC+3Eow3T7FMWmrIn6ccKgM84+WT7YQYtRqUYu3RIZbnlYXN7uH
 8xGXT6ccPcEwIjgyF87J0KyGhrbT1N91Jd2jMJkEry9OLAn/yr+pUBQtAa456MMi
 JYevS4atZaUqgvw=
 =iLfC
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Summary of the significant patches:

   - Record information about binds/unbinds to the audit multicast
     socket. This helps identify which processes have/had access to the
     information in the audit stream.

   - Cleanup and add some additional information to the netfilter
     configuration events collected by audit.

   - Fix some of the audit error handling code so we don't leak network
     namespace references"

* tag 'audit-pr-20200601' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: add subj creds to NETFILTER_CFG record to
  audit: Replace zero-length array with flexible-array
  audit: make symbol 'audit_nfcfgs' static
  netfilter: add audit table unregister actions
  audit: tidy and extend netfilter_cfg x_tables
  audit: log audit netlink multicast bind and unbind
  audit: fix a net reference leak in audit_list_rules_send()
  audit: fix a net reference leak in audit_send_reply()
2020-06-02 17:13:37 -07:00
Jason Gunthorpe
649392bf75 RDMA: Remove 'max_fmr'
Now that FMR support is gone, this attribute can be deleted from all
places.

Link: https://lore.kernel.org/r/12-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
Reviewed-by: Max Gurtovoy <maxg@mellanox.com>
Reviewed-by: Bernard Metzler <bmt@zurich.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-06-02 20:32:54 -03:00
Max Gurtovoy
07549ee21c RDMA/rds: Remove FMR support for memory registration
Use FRWR method for memory registration by default and remove the ancient
and unsafe FMR method.

Link: https://lore.kernel.org/r/3-v3-f58e6669d5d3+2cf-fmr_removal_jgg@mellanox.com
Signed-off-by: Max Gurtovoy <maxg@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-06-02 20:32:53 -03:00
Tuong Lien
a275727b18 Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv"
This reverts commit 441870ee42.

Like the previous patch in this series, we revert the above commit that
causes similar issues with the 'aead' object.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-02 15:13:47 -07:00
Tuong Lien
049fa17f7a Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv"
This reverts commit de05842076.

There is no actual tipc_node refcnt leak as stated in the above commit.
The refcnt is hold carefully for the case of an asynchronous decryption
(i.e. -EINPROGRESS/-EBUSY and skb = NULL is returned), so that the node
object cannot be freed in the meantime. The counter will be re-balanced
when the operation's callback arrives with the decrypted buffer if any.
In other cases, e.g. a synchronous crypto the counter will be decreased
immediately when it is done.

Now with that commit, a kernel panic will occur when there is no node
found (i.e. n = NULL) in the 'tipc_rcv()' or a premature release of the
node object.

This commit solves the issues by reverting the said commit, but keeping
one valid case that the 'skb_linearize()' is failed.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Tested-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-02 15:13:46 -07:00
Daniel Borkmann
7cdec54f97 bpf: Add csum_level helper for fixing up csum levels
Add a bpf_csum_level() helper which BPF programs can use in combination
with bpf_skb_adjust_room() when they pass in BPF_F_ADJ_ROOM_NO_CSUM_RESET
flag to the latter to avoid falling back to CHECKSUM_NONE.

The bpf_csum_level() allows to adjust CHECKSUM_UNNECESSARY skb->csum_levels
via BPF_CSUM_LEVEL_{INC,DEC} which calls __skb_{incr,decr}_checksum_unnecessary()
on the skb. The helper also allows a BPF_CSUM_LEVEL_RESET which sets the skb's
csum to CHECKSUM_NONE as well as a BPF_CSUM_LEVEL_QUERY to just return the
current level. Without this helper, there is no way to otherwise adjust the
skb->csum_level. I did not add an extra dummy flags as there is plenty of free
bitspace in level argument itself iff ever needed in future.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Acked-by: Lorenz Bauer <lmb@cloudflare.com>
Link: https://lore.kernel.org/bpf/279ae3717cb3d03c0ffeb511493c93c450a01e1a.1591108731.git.daniel@iogearbox.net
2020-06-02 11:50:23 -07:00
Daniel Borkmann
836e66c218 bpf: Fix up bpf_skb_adjust_room helper's skb csum setting
Lorenz recently reported:

  In our TC classifier cls_redirect [0], we use the following sequence of
  helper calls to decapsulate a GUE (basically IP + UDP + custom header)
  encapsulated packet:

    bpf_skb_adjust_room(skb, -encap_len, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_FIXED_GSO)
    bpf_redirect(skb->ifindex, BPF_F_INGRESS)

  It seems like some checksums of the inner headers are not validated in
  this case. For example, a TCP SYN packet with invalid TCP checksum is
  still accepted by the network stack and elicits a SYN ACK. [...]

  That is, we receive the following packet from the driver:

    | ETH | IP | UDP | GUE | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

  ip_summed is CHECKSUM_UNNECESSARY because our NICs do rx checksum offloading.
  On this packet we run skb_adjust_room_mac(-encap_len), and get the following:

    | ETH | IP | TCP |
    skb->ip_summed == CHECKSUM_UNNECESSARY

  Note that ip_summed is still CHECKSUM_UNNECESSARY. After bpf_redirect()'ing
  into the ingress, we end up in tcp_v4_rcv(). There, skb_checksum_init() is
  turned into a no-op due to CHECKSUM_UNNECESSARY.

The bpf_skb_adjust_room() helper is not aware of protocol specifics. Internally,
it handles the CHECKSUM_COMPLETE case via skb_postpull_rcsum(), but that does
not cover CHECKSUM_UNNECESSARY. In this case skb->csum_level of the original
skb prior to bpf_skb_adjust_room() call was 0, that is, covering UDP. Right now
there is no way to adjust the skb->csum_level. NICs that have checksum offload
disabled (CHECKSUM_NONE) or that support CHECKSUM_COMPLETE are not affected.

Use a safe default for CHECKSUM_UNNECESSARY by resetting to CHECKSUM_NONE and
add a flag to the helper called BPF_F_ADJ_ROOM_NO_CSUM_RESET that allows users
from opting out. Opting out is useful for the case where we don't remove/add
full protocol headers, or for the case where a user wants to adjust the csum
level manually e.g. through bpf_csum_level() helper that is added in subsequent
patch.

The bpf_skb_proto_{4_to_6,6_to_4}() for NAT64/46 translation from the BPF
bpf_skb_change_proto() helper uses bpf_skb_net_hdr_{push,pop}() pair internally
as well but doesn't change layers, only transitions between v4 to v6 and vice
versa, therefore no adoption is required there.

  [0] https://lore.kernel.org/bpf/20200424185556.7358-1-lmb@cloudflare.com/

Fixes: 2be7e212d5 ("bpf: add bpf_skb_adjust_room helper")
Reported-by: Lorenz Bauer <lmb@cloudflare.com>
Reported-by: Alan Maguire <alan.maguire@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Alan Maguire <alan.maguire@oracle.com>
Link: https://lore.kernel.org/bpf/CACAyw9-uU_52esMd1JjuA80fRPHJv5vsSg8GnfW3t_qDU4aVKQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/11a90472e7cce83e76ddbfce81fdfce7bfc68808.1591108731.git.daniel@iogearbox.net
2020-06-02 11:50:23 -07:00
Christoph Hellwig
88dca4ca5a mm: remove the pgprot argument to __vmalloc
The pgprot argument to __vmalloc is always PAGE_KERNEL now, so remove it.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Reviewed-by: Michael Kelley <mikelley@microsoft.com> [hyperv]
Acked-by: Gao Xiang <xiang@kernel.org> [erofs]
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Wei Liu <wei.liu@kernel.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-22-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Christoph Hellwig
ed1f324c5f mm: remove map_vm_range
Switch all callers to map_kernel_range, which symmetric to the unmap side
(as well as the _noflush versions).

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Christophe Leroy <christophe.leroy@c-s.fr>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: David Airlie <airlied@linux.ie>
Cc: Gao Xiang <xiang@kernel.org>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Haiyang Zhang <haiyangz@microsoft.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: "K. Y. Srinivasan" <kys@microsoft.com>
Cc: Laura Abbott <labbott@redhat.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Michael Kelley <mikelley@microsoft.com>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nitin Gupta <ngupta@vflare.org>
Cc: Robin Murphy <robin.murphy@arm.com>
Cc: Sakari Ailus <sakari.ailus@linux.intel.com>
Cc: Stephen Hemminger <sthemmin@microsoft.com>
Cc: Sumit Semwal <sumit.semwal@linaro.org>
Cc: Wei Liu <wei.liu@kernel.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Paul Mackerras <paulus@ozlabs.org>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Link: http://lkml.kernel.org/r/20200414131348.444715-17-hch@lst.de
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-06-02 10:59:11 -07:00
Stefano Stabellini
36f9967531 9p/xen: increase XEN_9PFS_RING_ORDER
Increase XEN_9PFS_RING_ORDER to 9 for performance reason. Order 9 is the
max allowed by the protocol.

We can't assume that all backends will support order 9. The xenstore
property max-ring-page-order specifies the max order supported by the
backend. We'll use max-ring-page-order for the size of the ring.

This means that the size of the ring is not static
(XEN_FLEX_RING_SIZE(9)) anymore. Change XEN_9PFS_RING_SIZE to take an
argument and base the calculation on the order chosen at setup time.

Finally, modify p9_xen_trans.maxsize to be divided by 4 compared to the
original value. We need to divide it by 2 because we have two rings
coming off the same order allocation: the in and out rings. This was a
mistake in the original code. Also divide it further by 2 because we
don't want a single request/reply to fill up the entire ring. There can
be multiple requests/replies outstanding at any given time and if we use
the full ring with one, we risk forcing the backend to wait for the
client to read back more replies before continuing, which is not
performant.

Link: http://lkml.kernel.org/r/20200521193242.15953-1-sstabellini@kernel.org
Signed-off-by: Stefano Stabellini <stefano.stabellini@xilinx.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2020-06-02 08:00:39 +02:00
David S. Miller
9a25c1df24 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-06-01

The following pull-request contains BPF updates for your *net-next* tree.

We've added 55 non-merge commits during the last 1 day(s) which contain
a total of 91 files changed, 4986 insertions(+), 463 deletions(-).

The main changes are:

1) Add rx_queue_mapping to bpf_sock from Amritha.

2) Add BPF ring buffer, from Andrii.

3) Attach and run programs through devmap, from David.

4) Allow SO_BINDTODEVICE opt in bpf_setsockopt, from Ferenc.

5) link based flow_dissector, from Jakub.

6) Use tracing helpers for lsm programs, from Jiri.

7) Several sk_msg fixes and extensions, from John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 15:53:08 -07:00
Vinay Kumar Yadav
6abde0b241 crypto/chtls: IPv6 support for inline TLS
Extends support to IPv6 for Inline TLS server.

Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>

v1->v2:
- cc'd tcp folks.

v2->v3:
- changed EXPORT_SYMBOL() to EXPORT_SYMBOL_GPL()

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 15:51:25 -07:00
Hangbin Liu
79a1f0ccdb ipv6: fix IPV6_ADDRFORM operation logic
Socket option IPV6_ADDRFORM supports UDP/UDPLITE and TCP at present.
Previously the checking logic looks like:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
	do_some_check;
else if (sk->sk_protocol != IPPROTO_TCP)
	break;

After commit b6f6118901 ("ipv6: restrict IPV6_ADDRFORM operation"), TCP
was blocked as the logic changed to:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
	do_some_check;
else if (sk->sk_protocol == IPPROTO_TCP)
	do_some_check;
	break;
else
	break;

Then after commit 82c9ae4408 ("ipv6: fix restrict IPV6_ADDRFORM operation")
UDP/UDPLITE were blocked as the logic changed to:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
	do_some_check;
if (sk->sk_protocol == IPPROTO_TCP)
	do_some_check;

if (sk->sk_protocol != IPPROTO_TCP)
	break;

Fix it by using Eric's code and simply remove the break in TCP check, which
looks like:
if (sk->sk_protocol == IPPROTO_UDP || sk->sk_protocol == IPPROTO_UDPLITE)
	do_some_check;
else if (sk->sk_protocol == IPPROTO_TCP)
	do_some_check;
else
	break;

Fixes: 82c9ae4408 ("ipv6: fix restrict IPV6_ADDRFORM operation")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 15:47:04 -07:00
YueHaibing
4c21daae3d tipc: Fix NULL pointer dereference in __tipc_sendstream()
tipc_sendstream() may send zero length packet, then tipc_msg_append()
do not alloc skb, skb_peek_tail() will get NULL, msg_set_ack_required
will trigger NULL pointer dereference.

Reported-by: syzbot+8eac6d030e7807c21d32@syzkaller.appspotmail.com
Fixes: 0a3e060f34 ("tipc: add test for Nagle algorithm effectiveness")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 15:33:24 -07:00
Jakub Sitnicki
b27f7bb590 flow_dissector: Move out netns_bpf prog callbacks
Move functions to manage BPF programs attached to netns that are not
specific to flow dissector to a dedicated module named
bpf/net_namespace.c.

The set of functions will grow with the addition of bpf_link support for
netns attached programs. This patch prepares ground by creating a place
for it.

This is a code move with no functional changes intended.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-4-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Jakub Sitnicki
a3fd7ceee0 net: Introduce netns_bpf for BPF programs attached to netns
In order to:

 (1) attach more than one BPF program type to netns, or
 (2) support attaching BPF programs to netns with bpf_link, or
 (3) support multi-prog attach points for netns

we will need to keep more state per netns than a single pointer like we
have now for BPF flow dissector program.

Prepare for the above by extracting netns_bpf that is part of struct net,
for storing all state related to BPF programs attached to netns.

Turn flow dissector callbacks for querying/attaching/detaching a program
into generic ones that operate on netns_bpf. Next patch will move the
generic callbacks into their own module.

This is similar to how it is organized for cgroup with cgroup_bpf.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-3-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Jakub Sitnicki
171526f6fe flow_dissector: Pull locking up from prog attach callback
Split out the part of attach callback that happens with attach/detach lock
acquired. This structures the prog attach callback in a way that opens up
doors for moving the locking out of flow_dissector and into generic
callbacks for attaching/detaching progs to netns in subsequent patches.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200531082846.2117903-2-jakub@cloudflare.com
2020-06-01 15:21:02 -07:00
Ferenc Fejes
70c58997c1 bpf: Allow SO_BINDTODEVICE opt in bpf_setsockopt
Extending the supported sockopts in bpf_setsockopt with
SO_BINDTODEVICE. We call sock_bindtoindex with parameter
lock_sk = false in this context because we already owning
the socket.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/4149e304867b8d5a606a305bc59e29b063e51f49.1590871065.git.fejes@inf.elte.hu
2020-06-01 14:57:14 -07:00
Ferenc Fejes
8ea204c2b6 net: Make locking in sock_bindtoindex optional
The sock_bindtoindex intended for kernel wide usage however
it will lock the socket regardless of the context. This modification
relax this behavior optionally: locking the socket will be optional
by calling the sock_bindtoindex with lock_sk = true.

The modification applied to all users of the sock_bindtoindex.

Signed-off-by: Ferenc Fejes <fejes@inf.elte.hu>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/bee6355da40d9e991b2f2d12b67d55ebb5f5b207.1590871065.git.fejes@inf.elte.hu
2020-06-01 14:57:14 -07:00
John Fastabend
e91de6afa8 bpf: Fix running sk_skb program types with ktls
KTLS uses a stream parser to collect TLS messages and send them to
the upper layer tls receive handler. This ensures the tls receiver
has a full TLS header to parse when it is run. However, when a
socket has BPF_SK_SKB_STREAM_VERDICT program attached before KTLS
is enabled we end up with two stream parsers running on the same
socket.

The result is both try to run on the same socket. First the KTLS
stream parser runs and calls read_sock() which will tcp_read_sock
which in turn calls tcp_rcv_skb(). This dequeues the skb from the
sk_receive_queue. When this is done KTLS code then data_ready()
callback which because we stacked KTLS on top of the bpf stream
verdict program has been replaced with sk_psock_start_strp(). This
will in turn kick the stream parser again and eventually do the
same thing KTLS did above calling into tcp_rcv_skb() and dequeuing
a skb from the sk_receive_queue.

At this point the data stream is broke. Part of the stream was
handled by the KTLS side some other bytes may have been handled
by the BPF side. Generally this results in either missing data
or more likely a "Bad Message" complaint from the kTLS receive
handler as the BPF program steals some bytes meant to be in a
TLS header and/or the TLS header length is no longer correct.

We've already broke the idealized model where we can stack ULPs
in any order with generic callbacks on the TX side to handle this.
So in this patch we do the same thing but for RX side. We add
a sk_psock_strp_enabled() helper so TLS can learn a BPF verdict
program is running and add a tls_sw_has_ctx_rx() helper so BPF
side can learn there is a TLS ULP on the socket.

Then on BPF side we omit calling our stream parser to avoid
breaking the data stream for the KTLS receiver. Then on the
KTLS side we call BPF_SK_SKB_STREAM_VERDICT once the KTLS
receiver is done with the packet but before it posts the
msg to userspace. This gives us symmetry between the TX and
RX halfs and IMO makes it usable again. On the TX side we
process packets in this order BPF -> TLS -> TCP and on
the receive side in the reverse order TCP -> TLS -> BPF.

Discovered while testing OpenSSL 3.0 Alpha2.0 release.

Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/159079361946.5745.605854335665044485.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
John Fastabend
ca2f5f21db bpf: Refactor sockmap redirect code so its easy to reuse
We will need this block of code called from tls context shortly
lets refactor the redirect logic so its easy to use. This also
cleans up the switch stmt so we have fewer fallthrough cases.

No logic changes are intended.

Fixes: d829e9c411 ("tls: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/159079360110.5745.7024009076049029819.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern
64b59025c1 xdp: Add xdp_txq_info to xdp_buff
Add xdp_txq_info as the Tx counterpart to xdp_rxq_info. At the
moment only the device is added. Other fields (queue_index)
can be added as use cases arise.

>From a UAPI perspective, add egress_ifindex to xdp context for
bpf programs to see the Tx device.

Update the verifier to only allow accesses to egress_ifindex by
XDP programs with BPF_XDP_DEVMAP expected attach type.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-4-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
David Ahern
fbee97feed bpf: Add support to attach bpf program to a devmap entry
Add BPF_XDP_DEVMAP attach type for use with programs associated with a
DEVMAP entry.

Allow DEVMAPs to associate a program with a device entry by adding
a bpf_prog.fd to 'struct bpf_devmap_val'. Values read show the program
id, so the fd and id are a union. bpf programs can get access to the
struct via vmlinux.h.

The program associated with the fd must have type XDP with expected
attach type BPF_XDP_DEVMAP. When a program is associated with a device
index, the program is run on an XDP_REDIRECT and before the buffer is
added to the per-cpu queue. At this point rxq data is still valid; the
next patch adds tx device information allowing the prorgam to see both
ingress and egress device indices.

XDP generic is skb based and XDP programs do not work with skb's. Block
the use case by walking maps used by a program that is to be attached
via xdpgeneric and fail if any of them are DEVMAP / DEVMAP_HASH with

Block attach of BPF_XDP_DEVMAP programs to devices.

Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200529220716.75383-3-dsahern@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:48:32 -07:00
Amritha Nambiar
c3c16f2ea6 bpf: Add rx_queue_mapping to bpf_sock
Add "rx_queue_mapping" to bpf_sock. This gives read access for the
existing field (sk_rx_queue_mapping) of struct sock from bpf_sock.
Semantics for the bpf_sock rx_queue_mapping access are similar to
sk_rx_queue_get(), i.e the value NO_QUEUE_MAPPING is not allowed
and -1 is returned in that case. This is useful for transmit queue
selection based on the received queue index which is cached in the
socket in the receive path.

v3: Addressed review comments to add usecase in patch description,
    and fixed default value for rx_queue_mapping.
v2: fixed build error for CONFIG_XPS wrapping, reported by
    kbuild test robot <lkp@intel.com>

Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:23 -07:00
John Fastabend
13d70f5a5e bpf, sk_msg: Add get socket storage helpers
Add helpers to use local socket storage.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159033907577.12355.14740125020572756560.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:20 -07:00
John Fastabend
abe3cac870 bpf, sk_msg: Add some generic helpers that may be useful from sk_msg
Add these generic helpers that may be useful to use from sk_msg programs.
The helpers do not depend on ctx so we can simply add them here,

 BPF_FUNC_perf_event_output
 BPF_FUNC_get_current_uid_gid
 BPF_FUNC_get_current_pid_tgid
 BPF_FUNC_get_current_cgroup_id
 BPF_FUNC_get_current_ancestor_cgroup_id
 BPF_FUNC_get_cgroup_classid

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/159033903373.12355.15489763099696629346.stgit@john-Precision-5820-Tower
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-06-01 14:38:20 -07:00
Ilya Dryomov
d3798acc09 libceph: support for alloc hint flags
Allow indicating future I/O pattern via flags.  This is supported since
Kraken (and bluestore persists flags together with expected_object_size
and expected_write_size).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jason Dillaman <dillaman@redhat.com>
2020-06-01 23:32:35 +02:00
Al Viro
547ce4cfb3 switch cmsghdr_from_user_compat_to_kern() to copy_from_user()
no point getting compat_cmsghdr field-by-field

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 12:05:45 -07:00
Guillaume Nault
4e4f4ce6ab cls_flower: remove mpls_opts_policy
Compiling with W=1 gives the following warning:
net/sched/cls_flower.c:731:1: warning: ‘mpls_opts_policy’ defined but not used [-Wunused-const-variable=]

The TCA_FLOWER_KEY_MPLS_OPTS contains a list of
TCA_FLOWER_KEY_MPLS_OPTS_LSE. Therefore, the attributes all have the
same type and we can't parse the list with nla_parse*() and have the
attributes validated automatically using an nla_policy.

fl_set_key_mpls_opts() properly verifies that all attributes in the
list are TCA_FLOWER_KEY_MPLS_OPTS_LSE. Then fl_set_key_mpls_lse()
uses nla_parse_nested() on all these attributes, thus verifying that
they have the NLA_F_NESTED flag. So we can safely drop the
mpls_opts_policy.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 12:01:05 -07:00
Linus Torvalds
81e8c10dac Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Introduce crypto_shash_tfm_digest() and use it wherever possible.
   - Fix use-after-free and race in crypto_spawn_alg.
   - Add support for parallel and batch requests to crypto_engine.

  Algorithms:
   - Update jitter RNG for SP800-90B compliance.
   - Always use jitter RNG as seed in drbg.

  Drivers:
   - Add Arm CryptoCell driver cctrng.
   - Add support for SEV-ES to the PSP driver in ccp"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (114 commits)
  crypto: hisilicon - fix driver compatibility issue with different versions of devices
  crypto: engine - do not requeue in case of fatal error
  crypto: cavium/nitrox - Fix a typo in a comment
  crypto: hisilicon/qm - change debugfs file name from qm_regs to regs
  crypto: hisilicon/qm - add DebugFS for xQC and xQE dump
  crypto: hisilicon/zip - add debugfs for Hisilicon ZIP
  crypto: hisilicon/hpre - add debugfs for Hisilicon HPRE
  crypto: hisilicon/sec2 - add debugfs for Hisilicon SEC
  crypto: hisilicon/qm - add debugfs to the QM state machine
  crypto: hisilicon/qm - add debugfs for QM
  crypto: stm32/crc32 - protect from concurrent accesses
  crypto: stm32/crc32 - don't sleep in runtime pm
  crypto: stm32/crc32 - fix multi-instance
  crypto: stm32/crc32 - fix run-time self test issue.
  crypto: stm32/crc32 - fix ext4 chksum BUG_ON()
  crypto: hisilicon/zip - Use temporary sqe when doing work
  crypto: hisilicon - add device error report through abnormal irq
  crypto: hisilicon - remove codes of directly report device errors through MSI
  crypto: hisilicon - QM memory management optimization
  crypto: hisilicon - unify initial value assignment into QM
  ...
2020-06-01 12:00:10 -07:00
Horatiu Vultur
c6676e7d62 bridge: mrp: Add support for role MRA
A node that has the MRA role, it can behave as MRM or MRC.

Initially it starts as MRM and sends MRP_Test frames on both ring ports.
If it detects that there are MRP_Test send by another MRM, then it
checks if these frames have a lower priority than itself. In this case
it would send MRP_Nack frames to notify the other node that it needs to
stop sending MRP_Test frames.
If it receives a MRP_Nack frame then it stops sending MRP_Test frames
and starts to behave as a MRC but it would continue to monitor the
MRP_Test frames send by MRM. If at a point the MRM stops to send
MRP_Test frames it would get the MRM role and start to send MRP_Test
frames.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:56:11 -07:00
Horatiu Vultur
4b3a61b030 bridge: mrp: Set the priority of MRP instance
Each MRP instance has a priority, a lower value means a higher priority.
The priority of MRP instance is stored in MRP_Test frame in this way
all the MRP nodes in the ring can see other nodes priority.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:56:11 -07:00
Arnd Bergmann
0af413bd3e flow_dissector: work around stack frame size warning
The fl_flow_key structure is around 500 bytes, so having two of them
on the stack in one function now exceeds the warning limit after an
otherwise correct change:

net/sched/cls_flower.c:298:12: error: stack frame size of 1056 bytes in function 'fl_classify' [-Werror,-Wframe-larger-than=]

I suspect the fl_classify function could be reworked to only have one
of them on the stack and modify it in place, but I could not work out
how to do that.

As a somewhat hacky workaround, move one of them into an out-of-line
function to reduce its scope. This does not necessarily reduce the stack
usage of the outer function, but at least the second copy is removed
from the stack during most of it and does not add up to whatever is
called from there.

I now see 552 bytes of stack usage for fl_classify(), plus 528 bytes
for fl_mask_lookup().

Fixes: 58cff782cc ("flow_dissector: Parse multiple MPLS Label Stack Entries")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:52:05 -07:00
Ido Schimmel
5eb18a2b6c devlink: Add ACL control packet traps
Add packet traps for packets that are sampled / trapped by ACLs, so that
capable drivers could register them with devlink. Add documentation for
every added packet trap and packet trap group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:49:23 -07:00
Ido Schimmel
d77cfd162a devlink: Add layer 3 control packet traps
Add layer 3 control packet traps such as ARP and DHCP, so that capable
device drivers could register them with devlink. Add documentation for
every added packet trap and packet trap group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:49:23 -07:00
Ido Schimmel
515eac677f devlink: Add layer 2 control packet traps
Add layer 2 control packet traps such as STP and IGMP query, so that
capable device drivers could register them with devlink. Add
documentation for every added packet trap and packet trap group.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:49:23 -07:00
Ido Schimmel
30a4e9a29a devlink: Add 'control' trap type
This type is used for traps that trap control packets such as ARP
request and IGMP query to the CPU.

Do not report such packets to the kernel's drop monitor as they were not
dropped by the device no encountered an exception during forwarding.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:49:23 -07:00
Ido Schimmel
9eefeabed6 devlink: Add 'mirror' trap action
The action is used by control traps such as IGMP query. The packet is
flooded by the device, but also trapped to the CPU in order for the
software bridge to mark the receiving port as a multicast router port.
Such packets are marked with 'skb->offload_fwd_mark = 1' in order to
prevent the software bridge from flooding them again.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:49:23 -07:00
Ido Schimmel
678eb199cc devlink: Create dedicated trap group for layer 3 exceptions
Packets that hit exceptions during layer 3 forwarding must be trapped to
the CPU for the control plane to function properly. Create a dedicated
group for them, so that user space could choose to assign a different
policer for them.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:49:23 -07:00
David S. Miller
af0a2482fa Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next
to extend ctnetlink and the flowtable infrastructure:

1) Extend ctnetlink kernel side netlink dump filtering capabilities,
   from Romain Bellan.

2) Generalise the flowtable hook parser to take a hook list.

3) Pass a hook list to the flowtable hook registration/unregistration.

4) Add a helper function to release the flowtable hook list.

5) Update the flowtable event notifier to pass a flowtable hook list.

6) Allow users to add new devices to an existing flowtables.

7) Allow users to remove devices to an existing flowtables.

8) Allow for registering a flowtable with no initial devices.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:46:30 -07:00
Pablo Neira Ayuso
709ffbe19b net: remove indirect block netdev event registration
Drivers do not register to netdev events to set up indirect blocks
anymore. Remove __flow_indr_block_cb_register() and
__flow_indr_block_cb_unregister().

The frontends set up the callbacks through flow_indr_dev_setup_block()

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:41:50 -07:00
Pablo Neira Ayuso
0fdcf78d59 net: use flow_indr_dev_setup_offload()
Update existing frontends to use flow_indr_dev_setup_offload().

This new function must be called if ->ndo_setup_tc is unset to deal
with tunnel devices.

If there is no driver that is subscribed to new tunnel device
flow_block bindings, then this function bails out with EOPNOTSUPP.

If the driver module is removed, the ->cleanup() callback removes the
entries that belong to this tunnel device. This cleanup procedures is
triggered when the device unregisters the tunnel device offload handler.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:41:12 -07:00
Pablo Neira Ayuso
324a823b99 net: cls_api: add tcf_block_offload_init()
Add a helper function to initialize the flow_block_offload structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:41:12 -07:00
Pablo Neira Ayuso
1fac52da59 net: flow_offload: consolidate indirect flow_block infrastructure
Tunnel devices provide no dev->netdev_ops->ndo_setup_tc(...) interface.
The tunnel device and route control plane does not provide an obvious
way to relate tunnel and physical devices.

This patch allows drivers to register a tunnel device offload handler
for the tc and netfilter frontends through flow_indr_dev_register() and
flow_indr_dev_unregister().

The frontend calls flow_indr_dev_setup_offload() that iterates over the
list of drivers that are offering tunnel device hardware offload
support and it sets up the flow block for this tunnel device.

If the driver module is removed, the indirect flow_block ends up with a
stale callback reference. The module removal path triggers the
dev_shutdown() path to remove the qdisc and the flow_blocks for the
physical devices. However, this is not useful for tunnel devices, where
relation between the physical and the tunnel device is not explicit.

This patch introduces a cleanup callback that is invoked when the driver
module is removed to clean up the tunnel device flow_block. This patch
defines struct flow_block_indr and it uses it from flow_block_cb to
store the information that front-end requires to perform the
flow_block_cb cleanup on module removal.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:41:12 -07:00
Pablo Neira Ayuso
a8284c6899 netfilter: nf_flowtable: expose nf_flow_table_gc_cleanup()
This function schedules the flow teardown state and it forces a gc run.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:41:12 -07:00
Davide Caratti
a01c245438 net/sched: fix a couple of splats in the error path of tfc_gate_init()
trying to configure TC 'act_gate' rules with invalid control actions, the
following splat can be observed:

 general protection fault, probably for non-canonical address 0xdffffc0000000002: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000010-0x0000000000000017]
 CPU: 1 PID: 2143 Comm: tc Not tainted 5.7.0-rc6+ #168
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 RIP: 0010:hrtimer_active+0x56/0x290
 [...]
  Call Trace:
  hrtimer_try_to_cancel+0x6d/0x330
  hrtimer_cancel+0x11/0x20
  tcf_gate_cleanup+0x15/0x30 [act_gate]
  tcf_action_cleanup+0x58/0x170
  __tcf_action_put+0xb0/0xe0
  __tcf_idr_release+0x68/0x90
  tcf_gate_init+0x7c7/0x19a0 [act_gate]
  tcf_action_init_1+0x60f/0x960
  tcf_action_init+0x157/0x2a0
  tcf_action_add+0xd9/0x2f0
  tc_ctl_action+0x2a3/0x39d
  rtnetlink_rcv_msg+0x5f3/0x920
  netlink_rcv_skb+0x121/0x350
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x714/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5b4/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

this is caused by hrtimer_cancel(), running before hrtimer_init(). Fix it
ensuring to call hrtimer_cancel() only if clockid is valid, and the timer
has been initialized. After fixing this splat, the same error path causes
another problem:

 general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] SMP KASAN NOPTI
 KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
 CPU: 1 PID: 980 Comm: tc Not tainted 5.7.0-rc6+ #168
 Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
 RIP: 0010:release_entry_list+0x4a/0x240 [act_gate]
 [...]
 Call Trace:
  tcf_action_cleanup+0x58/0x170
  __tcf_action_put+0xb0/0xe0
  __tcf_idr_release+0x68/0x90
  tcf_gate_init+0x7ab/0x19a0 [act_gate]
  tcf_action_init_1+0x60f/0x960
  tcf_action_init+0x157/0x2a0
  tcf_action_add+0xd9/0x2f0
  tc_ctl_action+0x2a3/0x39d
  rtnetlink_rcv_msg+0x5f3/0x920
  netlink_rcv_skb+0x121/0x350
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x714/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5b4/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

the problem is similar: tcf_action_cleanup() was trying to release a list
without initializing it first. Ensure that INIT_LIST_HEAD() is called for
every newly created 'act_gate' action, same as what was done to 'act_ife'
with commit 44c23d7159 ("net/sched: act_ife: initalize ife->metalist
earlier").

Fixes: a51c328df3 ("net: qos: introduce a gate control flow action")
CC: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:36:36 -07:00
Ido Schimmel
53fc685243 bridge: Avoid infinite loop when suppressing NS messages with invalid options
When neighbor suppression is enabled the bridge device might reply to
Neighbor Solicitation (NS) messages on behalf of remote hosts.

In case the NS message includes the "Source link-layer address" option
[1], the bridge device will use the specified address as the link-layer
destination address in its reply.

To avoid an infinite loop, break out of the options parsing loop when
encountering an option with length zero and disregard the NS message.

This is consistent with the IPv6 ndisc code and RFC 4886 which states
that "Nodes MUST silently discard an ND packet that contains an option
with length zero" [2].

[1] https://tools.ietf.org/html/rfc4861#section-4.3
[2] https://tools.ietf.org/html/rfc4861#section-4.6

Fixes: ed842faeb2 ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Alla Segal <allas@mellanox.com>
Tested-by: Alla Segal <allas@mellanox.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:08:41 -07:00
Patrick Eigensatz
dafe2078a7 ipv4: nexthop: Fix deadcode issue by performing a proper NULL check
After allocating the spare nexthop group it should be tested for kzalloc()
returning NULL, instead the already used nexthop group (which cannot be
NULL at this point) had been tested so far.

Additionally, if kzalloc() fails, return ERR_PTR(-ENOMEM) instead of NULL.

Coverity-id: 1463885
Reported-by: Coverity <scan-admin@coverity.com>
Signed-off-by: Patrick Eigensatz <patrickeigensatz@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:05:35 -07:00
David S. Miller
07f6ecec65 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-06-01

Here's one last bluetooth-next pull request for 5.8, which I hope can
still be accepted.

 - Enabled Wide-Band Speech (WBS) support for Qualcomm wcn3991
 - Multiple fixes/imprvovements to Qualcomm-based devices
 - Fix GAP/SEC/SEM/BI-10-C qualfication test case
 - Added support for Broadcom BCM4350C5 device
 - Several other smaller fixes & improvements

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-06-01 11:01:09 -07:00
Ilya Dryomov
8ad44d5e0d libceph: read_from_replica option
Expose replica reads through read_from_replica=balance and
read_from_replica=localize.  The default is to read from primary
(read_from_replica=no).

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-06-01 13:22:53 +02:00
Ilya Dryomov
117d96a04f libceph: support for balanced and localized reads
OSD-side issues with reads from replica have been resolved in
Octopus.  Reading from replica should be safe wrt. unstable or
uncommitted state now, so add support for balanced and localized
reads.

There are two cases when a read from replica can't be served:

- OSD may silently drop the request, expecting the client to
  notice that the acting set has changed and resend via the usual
  means (handled with t->used_replica)

- OSD may return EAGAIN, expecting the client to resend to the
  primary, ignoring replica read flags (see handle_reply())

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-06-01 13:22:53 +02:00
Ilya Dryomov
45e6aa9f55 libceph: crush_location infrastructure
Allow expressing client's location in terms of CRUSH hierarchy as
a set of (bucket type name, bucket name) pairs.  The userspace syntax
"crush_location = key1=value1 key2=value2" is incompatible with mount
options and needed adaptation.  Key-value pairs are separated by '|'
and we use ':' instead of '=' to separate keys from values.  So for:

  crush_location = host=foo rack=bar

one would write:

  crush_location=host:foo|rack:bar

As in userspace, "multipath" locations are supported, so indicating
locality for parallel hierarchies is possible:

  crush_location=rack:foo1|rack:foo2|datacenter:bar

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-06-01 13:22:53 +02:00
Ilya Dryomov
86403a92c3 libceph: decode CRUSH device/bucket types and names
These would be matched with the provided client location to calculate
the locality value.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-06-01 13:22:53 +02:00
Ilya Dryomov
8a4b863c87 libceph: add non-asserting rbtree insertion helper
Needed for the next commit and useful for ceph_pg_pool_info tree as
well.  I'm leaving the asserting helper in for now, but we should look
at getting rid of it in the future.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-06-01 13:22:53 +02:00
Xiubo Li
97e27aaa9a ceph: add read/write latency metric support
Calculate the latency for OSD read requests. Add a new r_end_stamp
field to struct ceph_osd_request that will hold the time of that
the reply was received. Use that to calculate the RTT for each call,
and divide the sum of those by number of calls to get averate RTT.

Keep a tally of RTT for OSD writes and number of calls to track average
latency of OSD writes.

URL: https://tracker.ceph.com/issues/43215
Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-06-01 13:22:51 +02:00
David S. Miller
1806c13dc2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
xdp_umem.c had overlapping changes between the 64-bit math fix
for the calculation of npgs and the removal of the zerocopy
memory type which got rid of the chunk_size_nohdr member.

The mlx5 Kconfig conflict is a case where we just take the
net-next copy of the Kconfig entry dependency as it takes on
the ESWITCH dependency by one level of indirection which is
what the 'net' conflicting change is trying to ensure.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31 17:48:46 -07:00
David S. Miller
1079a34c56 Another set of changes, including
* many 6 GHz changes, though it's not _quite_ complete
    (I left out scanning for now, we're still discussing)
  * allow userspace SA-query processing for operating channel
    validation
  * TX status for control port TX, for AP-side operation
  * more per-STA/TID control options
  * move to kHz for channels, for future S1G operation
  * various other small changes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl7TfecACgkQB8qZga/f
 l8Tx6hAAgRizfdHb9xxp001AAzKnsdU46srOKOhwV2d6w+S+qHbLtwa0Xz43pBvX
 LxpQs7dBQBLYh11xJhDlKY6duYV989xGcsHm7suO43jbjDo8KXfz4MaP65em6EKt
 pdD0mD1sKkfR4FhYNbUEe8Ug/185jdk+gX+aI1Nrz6XlkUoiY+czSnGFyAvpvau2
 I+NGqyKG5D6ureq7p7dQcgN+t2D4Ou9stVhpQ+jP0Ep720gvfTEzeFuMJbb3JZ1y
 KSgOOWS1HQj1FdlJDs3KAmgUXpkU/lxZhNxl06MMYo3tB7Y0vmLoy/ZNcb5eW4Sw
 a0SHgG5yhDysCyINz6q7llG3esDcppGiNuMjd/qR2qPOZPHNtlYaHtcoKBcKdS0k
 03DyURZpA0B33cr9FTV8tXaM7IMY/2qaq/DqkeNtuDzGdh4jEwkVJ4fNtUAdgcOv
 4JEz3A7fY3isy8tzi7Dom4U/2hR1di5gZloAC5PPYRvnbmY9HoIqG06k1Wtn1Yj4
 pbquqvdJ5ONcaAaXz7zVQUZm1JzrK81Pl3pdih7USasc8z2MEzWQPSR+hxtwG5TY
 KbDI1Nel8ZLbL2MWDakh3+lPoJAMuyadRlVVWEMj4l/afYHgcy5hEbaMbaZnxmAg
 G4I6R5JZTJZuVdKi/U/Q9n7jR83qfIRNbxMLY8HFZ4caJ5qhZGs=
 =wdaG
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2020-05-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Another set of changes, including
 * many 6 GHz changes, though it's not _quite_ complete
   (I left out scanning for now, we're still discussing)
 * allow userspace SA-query processing for operating channel
   validation
 * TX status for control port TX, for AP-side operation
 * more per-STA/TID control options
 * move to kHz for channels, for future S1G operation
 * various other small changes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-31 14:32:50 -07:00
Linus Torvalds
19835b1ba6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Another week, another set of bug fixes:

   1) Fix pskb_pull length in __xfrm_transport_prep(), from Xin Long.

   2) Fix double xfrm_state put in esp{4,6}_gro_receive(), also from Xin
      Long.

   3) Re-arm discovery timer properly in mac80211 mesh code, from Linus
      Lüssing.

   4) Prevent buffer overflows in nf_conntrack_pptp debug code, from
      Pablo Neira Ayuso.

   5) Fix race in ktls code between tls_sw_recvmsg() and
      tls_decrypt_done(), from Vinay Kumar Yadav.

   6) Fix crashes on TCP fallback in MPTCP code, from Paolo Abeni.

   7) More validation is necessary of untrusted GSO packets coming from
      virtualization devices, from Willem de Bruijn.

   8) Fix endianness of bnxt_en firmware message length accesses, from
      Edwin Peer.

   9) Fix infinite loop in sch_fq_pie, from Davide Caratti.

  10) Fix lockdep splat in DSA by setting lockless TX in netdev features
      for slave ports, from Vladimir Oltean.

  11) Fix suspend/resume crashes in mlx5, from Mark Bloch.

  12) Fix use after free in bpf fmod_ret, from Alexei Starovoitov.

  13) ARP retransmit timer guard uses wrong offset, from Hongbin Liu.

  14) Fix leak in inetdev_init(), from Yang Yingliang.

  15) Don't try to use inet hash and unhash in l2tp code, results in
      crashes. From Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (77 commits)
  l2tp: add sk_family checks to l2tp_validate_socket
  l2tp: do not use inet_hash()/inet_unhash()
  net: qrtr: Allocate workqueue before kernel_bind
  mptcp: remove msk from the token container at destruction time.
  mptcp: fix race between MP_JOIN and close
  mptcp: fix unblocking connect()
  net/sched: act_ct: add nat mangle action only for NAT-conntrack
  devinet: fix memleak in inetdev_init()
  virtio_vsock: Fix race condition in virtio_transport_recv_pkt
  drivers/net/ibmvnic: Update VNIC protocol version reporting
  NFC: st21nfca: add missed kfree_skb() in an error path
  neigh: fix ARP retransmit timer guard
  bpf, selftests: Add a verifier test for assigning 32bit reg states to 64bit ones
  bpf, selftests: Verifier bounds tests need to be updated
  bpf: Fix a verifier issue when assigning 32bit reg states to 64bit ones
  bpf: Fix use-after-free in fmod_ret check
  net/mlx5e: replace EINVAL in mlx5e_flower_parse_meta()
  net/mlx5e: Fix MLX5_TC_CT dependencies
  net/mlx5e: Properly set default values when disabling adaptive moderation
  net/mlx5e: Fix arch depending casting issue in FEC
  ...
2020-05-31 10:16:53 -07:00
David Howells
32f71aa497 rxrpc: Adjust /proc/net/rxrpc/calls to display call->debug_id not user_ID
The user ID value isn't actually much use - and leaks a kernel pointer or a
userspace value - so replace it with the call debug ID, which appears in trace
points.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
David Howells
23e2db311a rxrpc: Map the EACCES error produced by some ICMP6 to EHOSTUNREACH
Map the EACCES error that is produced by some ICMP6 packets to EHOSTUNREACH
when we get them as EACCES has other meanings within a filesystem context.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-31 15:19:51 +01:00
Nathan Errera
093a48d2aa cfg80211: support bigger kek/kck key length
With some newer AKMs, the KCK and KEK are bigger, so allow that
if the driver advertises support for it. In addition, add a new
attribute for the AKM so we can use it for offloaded rekeying.

Signed-off-by: Nathan Errera <nathan.errera@intel.com>
[reword commit message]
Link: https://lore.kernel.org/r/20200528212237.5eb58b00a5d1.I61b09d77c4f382e8d58a05dcca78096e99a6bc15@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:24 +02:00
Tova Mussai
07c12d618f mac80211: set short_slot for 6 GHz band
Set short slot also for 6 GHz band, just like 5 GHz.

Signed-off-by: Tova Mussai <tova.mussai@intel.com>
Link: https://lore.kernel.org/r/20200528213443.75f38e6f5efd.I272fbae402b03123f04e9ae69204eeab960c70cd@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:22 +02:00
Ilan Peer
6fcb56ce0f mac80211: Consider 6 GHz band when handling power constraint
Treat it like the 5 GHz band.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Link: https://lore.kernel.org/r/20200528213443.889e5c9dd006.Id8ed3bb8000ba8738be5df05639415eb2e23c61a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:20 +02:00
Johannes Berg
93382a0d11 mac80211: accept aggregation sessions on 6 GHz
On 6 GHz, stations don't have ht_supported set, but they can
still do aggregation since they must have HE, allow that.

Link: https://lore.kernel.org/r/20200528213443.776d3c891b64.Ifa099d450617b50c691832b3c4aa08959fab520a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:16 +02:00
Johannes Berg
f438136528 cfg80211: require HE capabilities for 6 GHz band
On 6 GHz band, HE capabilities must be available for all of
the interface types, otherwise we shouldn't use 6 GHz. Check
this.

Link: https://lore.kernel.org/r/20200528213443.5881cb3c8c4a.I583b54172f91f98d44af64a16c5826fe458cbb27@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:14 +02:00
Johannes Berg
461ce35d55 cfg80211: reject HT/VHT capabilities on 6 GHz band
On the 6 GHz band, HE should be used, but without any direct HT/VHT
capabilities, instead the HE 6 GHz band capabilities will capture
the relevant information. Reject HT/VHT capabilities here.

Link: https://lore.kernel.org/r/20200528213443.bfe89c35459a.Ibba5e066fa0087fd49d13cfee89d196ea0c68ae2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:11 +02:00
Johannes Berg
ba8f6a037f cfg80211: treat 6 GHz channels as valid regardless of capability
If a 6 GHz channel exists, then we can probably safely assume that
the device actually supports it, and then it should support most
bandwidths.

This will probably need to be extended to check the interface type
and then dig into the HE capabilities for that though, to have the
correct bandwidth check.

Link: https://lore.kernel.org/r/20200528213443.d4864ef52e92.I82f09b2b14a56413ce20376d09967fe954a033eb@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:08 +02:00
Ilan Peer
2ad2274c58 mac80211: Add HE 6GHz capabilities element to probe request
On 6 GHz, the 6 GHz capabilities element should be added, do that.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
[add commit message]
Link: https://lore.kernel.org/r/20200528213443.8ee764f0cde0.I2b0c66b60e11818c97c9803e04a6a197c6376243@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:05 +02:00
Johannes Berg
1bb9a8a4c8 mac80211: use HE 6 GHz band capability and pass it to the driver
In order to handle 6 GHz AP side, take the HE 6 GHz band capability
data and pass it to the driver (which needs it for A-MPDU spacing
and A-MPDU length).

Link: https://lore.kernel.org/r/1589399105-25472-6-git-send-email-rmanohar@codeaurora.org
Co-developed-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/20200528213443.784e4890d82f.I5f1230d5ab27e84e7bbe88e3645b24ea15a0c146@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:27:03 +02:00
Shaul Triebitz
3b3ec3d52e mac80211: check the correct bit for EMA AP
An AP supporting EMA (Enhanced Multi-BSSID advertisement) should set
bit 83 in the extended capabilities IE (9.4.2.26 in the 802.11ax D5 spec).
So the *3rd* bit of the 10th byte should be checked.
Also, in one place, the wrong byte was checked.
(cfg80211_find_ie returns a pointer to the beginning of the IE,
 so the data really starts at ie[2], so the 10th byte
 should be ie[12]. To avoid this confusion, use cfg80211_find_elem
 instead).

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Link: https://lore.kernel.org/r/20200528213443.4316121fa2a3.I9745582f8d41ad8e689dac0fefcd70b276d7c1ea@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:59 +02:00
Johannes Berg
57fa5e85d5 mac80211: determine chandef from HE 6 GHz operation
Support connecting to HE 6 GHz APs and mesh networks on 6 GHz,
where the HT/VHT information is missing but instead the HE 6 GHz
band capability is present, and the 6 GHz Operation information
field is used to encode the channel configuration instead of the
HT/VHT operation elements.

Also add some other bits needed to connect to 6 GHz networks.

Link: https://lore.kernel.org/r/1589399105-25472-10-git-send-email-rmanohar@codeaurora.org
Co-developed-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/20200528213443.25687d2695bc.I3f9747c1147480f65445f13eda5c4a5ed4e86757@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:52 +02:00
Johannes Berg
2a333a0db2 mac80211: avoid using ext NSS high BW if not supported
If the AP advertises inconsistent data, namely it has CCFS1 or CCFS2,
but doesn't advertise support for 160/80+80 bandwidth or "Extended NSS
BW Support", then we cannot use any MCSes in the the higher bandwidth.
Thus, avoid connecting with higher bandwidth since it's less efficient
that way.

Link: https://lore.kernel.org/r/20200528213443.0e55d40c3ccc.I6fd0b4708ebd087e5e46466c3e91f6efbcbef668@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:50 +02:00
Rajkumar Manoharan
607ca9ea34 mac80211: do not allow HT/VHT IEs in 6 GHz mesh mode
As HT/VHT elements are not allowed in 6 GHz band, do not include
them in mesh beacon template formation.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1589399105-25472-9-git-send-email-rmanohar@codeaurora.org
Link: https://lore.kernel.org/r/20200528193455.76796-2-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:46 +02:00
Rajkumar Manoharan
d1b7524b3e mac80211: build HE operation with 6 GHz oper information
Add 6 GHz operation information (IEEE 802.11ax/D6.0, Figure 9-787k)
while building HE operation element for non-HE AP. This field is used to
determine channel information in the absence of HT/VHT IEs.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1589399105-25472-8-git-send-email-rmanohar@codeaurora.org
[fix skb allocation size]
Link: https://lore.kernel.org/r/20200528193455.76796-1-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:42 +02:00
Rajkumar Manoharan
24a2042cb2 mac80211: add HE 6 GHz Band Capability element
Construct HE 6 GHz band capability element (IEEE 802.11ax/D6.0,
9.4.2.261) for association request and mesh beacon. The 6 GHz
capability information is passed by driver through iftypes caps.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1589399105-25472-7-git-send-email-rmanohar@codeaurora.org
[handle SMPS, adjust for previous patches, reserve SKB space properly,
 change to handle SKB directly]
Link: https://lore.kernel.org/r/20200528213443.643aa8101111.I3f9747c1147480f65445f13eda5c4a5ed4e86757@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:39 +02:00
Johannes Berg
2239521772 cfg80211: add and expose HE 6 GHz band capabilities
These capabilities cover what would otherwise be transported
in HT/VHT capabilities, but only a subset thereof that is
actually needed on 6 GHz with HE already present. Expose the
capabilities to userspace, drivers are expected to set them
as using the 6 GHz band (currently) requires HE capability.

Link: https://lore.kernel.org/r/20200528213443.244cd5cb9db8.Icd8c773277a88c837e7e3af1d4d1013cc3b66543@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:36 +02:00
Rajkumar Manoharan
a6cf28e05f mac80211: add HE 6 GHz Band Capabilities into parse extension
Handle 6 GHz band capability element parsing for association.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1589399105-25472-4-git-send-email-rmanohar@codeaurora.org
[some renaming to be in line with previous patches]
Link: https://lore.kernel.org/r/20200528213443.a13d7a0b85b0.Ia07584da4fc77aa77c4cc563248d2ce4234ffe5d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:31 +02:00
Rajkumar Manoharan
43e64bf301 cfg80211: handle 6 GHz capability of new station
Handle 6 GHz HE capability while adding new station. It will be used
later in mac80211 station processing.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1589399105-25472-2-git-send-email-rmanohar@codeaurora.org
[handle nl80211_set_station, require WME,
 remove NL80211_HE_6GHZ_CAPABILITY_LEN]
Link: https://lore.kernel.org/r/20200528213443.b6b711fd4312.Ic9b97d57b6c4f2b28d4b2d23d2849d8bc20bd8cc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:20 +02:00
Johannes Berg
0e47901d78 nl80211: really allow client-only BIGTK support
My previous commit here was wrong, it didn't check the new
flag in two necessary places, so things didn't work. Fix that.

Fixes: 155d7c7338 ("nl80211: allow client-only BIGTK support")
Link: https://lore.kernel.org/r/20200528213443.993f108e96ca.I0086ae42d672379380d04ac5effb2f3d5135731b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:05 +02:00
Arend Van Spriel
d1a1646c0d cfg80211: adapt to new channelization of the 6GHz band
The 6GHz band does not have regulatory approval yet, but things are
moving forward. However, that has led to a change in the channelization
of the 6GHz band which has been accepted in the 11ax specification. It
also fixes a missing MHZ_TO_KHZ() macro for 6GHz channels while at it.

This change is primarily thrown in to discuss how to deal with it.
I noticed ath11k adding 6G support with old channelization and ditto
for iw. It probably involves changes in hostapd as well.

Cc: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Cc: Jouni Malinen <jouni@w1.fi>
Signed-off-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Link: https://lore.kernel.org/r/edf07cdd-ad15-4012-3afd-d8b961a80b69@broadcom.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:26:03 +02:00
Johannes Berg
5e9cf0f0a3 cfg80211: fix 6 GHz frequencies to kHz
The updates to change to kHz frequencies and the 6 GHz
additions evidently overlapped (or rather, I didn't see
it when applying the latter), so the 6 GHz is broken.
Fix this.

Fixes: 934f4c7dd3 ("cfg80211: express channels with a KHz component")
Link: https://lore.kernel.org/r/20200529140425.1bf824f6911b.I4a1174916b8f5965af4366999eb9ffc7a0347470@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-31 11:25:59 +02:00
Eric Dumazet
d9a81a2252 l2tp: add sk_family checks to l2tp_validate_socket
syzbot was able to trigger a crash after using an ISDN socket
and fool l2tp.

Fix this by making sure the UDP socket is of the proper family.

BUG: KASAN: slab-out-of-bounds in setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78
Write of size 1 at addr ffff88808ed0c590 by task syz-executor.5/3018

CPU: 0 PID: 3018 Comm: syz-executor.5 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd3/0x413 mm/kasan/report.c:382
 __kasan_report.cold+0x20/0x38 mm/kasan/report.c:511
 kasan_report+0x33/0x50 mm/kasan/common.c:625
 setup_udp_tunnel_sock+0x465/0x540 net/ipv4/udp_tunnel.c:78
 l2tp_tunnel_register+0xb15/0xdd0 net/l2tp/l2tp_core.c:1523
 l2tp_nl_cmd_tunnel_create+0x4b2/0xa60 net/l2tp/l2tp_netlink.c:249
 genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:718 [inline]
 genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735
 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:746
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6e6/0x810 net/socket.c:2352
 ___sys_sendmsg+0x100/0x170 net/socket.c:2406
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2439
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45ca29
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007effe76edc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004fe1c0 RCX: 000000000045ca29
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000094e R14: 00000000004d5d00 R15: 00007effe76ee6d4

Allocated by task 3018:
 save_stack+0x1b/0x40 mm/kasan/common.c:49
 set_track mm/kasan/common.c:57 [inline]
 __kasan_kmalloc mm/kasan/common.c:495 [inline]
 __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:468
 __do_kmalloc mm/slab.c:3656 [inline]
 __kmalloc+0x161/0x7a0 mm/slab.c:3665
 kmalloc include/linux/slab.h:560 [inline]
 sk_prot_alloc+0x223/0x2f0 net/core/sock.c:1612
 sk_alloc+0x36/0x1100 net/core/sock.c:1666
 data_sock_create drivers/isdn/mISDN/socket.c:600 [inline]
 mISDN_sock_create+0x272/0x400 drivers/isdn/mISDN/socket.c:796
 __sock_create+0x3cb/0x730 net/socket.c:1428
 sock_create net/socket.c:1479 [inline]
 __sys_socket+0xef/0x200 net/socket.c:1521
 __do_sys_socket net/socket.c:1530 [inline]
 __se_sys_socket net/socket.c:1528 [inline]
 __x64_sys_socket+0x6f/0xb0 net/socket.c:1528
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Freed by task 2484:
 save_stack+0x1b/0x40 mm/kasan/common.c:49
 set_track mm/kasan/common.c:57 [inline]
 kasan_set_free_info mm/kasan/common.c:317 [inline]
 __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:456
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x109/0x2b0 mm/slab.c:3757
 kvfree+0x42/0x50 mm/util.c:603
 __free_fdtable+0x2d/0x70 fs/file.c:31
 put_files_struct fs/file.c:420 [inline]
 put_files_struct+0x248/0x2e0 fs/file.c:413
 exit_files+0x7e/0xa0 fs/file.c:445
 do_exit+0xb04/0x2dd0 kernel/exit.c:791
 do_group_exit+0x125/0x340 kernel/exit.c:894
 get_signal+0x47b/0x24e0 kernel/signal.c:2739
 do_signal+0x81/0x2240 arch/x86/kernel/signal.c:784
 exit_to_usermode_loop+0x26c/0x360 arch/x86/entry/common.c:161
 prepare_exit_to_usermode arch/x86/entry/common.c:196 [inline]
 syscall_return_slowpath arch/x86/entry/common.c:279 [inline]
 do_syscall_64+0x6b1/0x7d0 arch/x86/entry/common.c:305
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

The buggy address belongs to the object at ffff88808ed0c000
 which belongs to the cache kmalloc-2k of size 2048
The buggy address is located 1424 bytes inside of
 2048-byte region [ffff88808ed0c000, ffff88808ed0c800)
The buggy address belongs to the page:
page:ffffea00023b4300 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0
flags: 0xfffe0000000200(slab)
raw: 00fffe0000000200 ffffea0002838208 ffffea00015ba288 ffff8880aa000e00
raw: 0000000000000000 ffff88808ed0c000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88808ed0c480: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff88808ed0c500: 00 00 00 fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88808ed0c580: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
                         ^
 ffff88808ed0c600: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff88808ed0c680: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 6b9f34239b ("l2tp: fix races in tunnel creation")
Fixes: fd558d186d ("l2tp: Split pppol2tp patch into separate l2tp and ppp parts")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Guillaume Nault <gnault@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:56:55 -07:00
Eric Dumazet
02c71b144c l2tp: do not use inet_hash()/inet_unhash()
syzbot recently found a way to crash the kernel [1]

Issue here is that inet_hash() & inet_unhash() are currently
only meant to be used by TCP & DCCP, since only these protocols
provide the needed hashinfo pointer.

L2TP uses a single list (instead of a hash table)

This old bug became an issue after commit 6102365876
("bpf: Add new cgroup attach type to enable sock modifications")
since after this commit, sk_common_release() can be called
while the L2TP socket is still considered 'hashed'.

general protection fault, probably for non-canonical address 0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 7063 Comm: syz-executor654 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600
Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e dd 04 00 00 48 8d 7d 08 44 8b 73 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 55 05 00 00 48 8d 7d 14 4c 8b 6d 08 48 b8 00 00
RSP: 0018:ffffc90001777d30 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff88809a6df940 RCX: ffffffff8697c242
RDX: 0000000000000001 RSI: ffffffff8697c251 RDI: 0000000000000008
RBP: 0000000000000000 R08: ffff88809f3ae1c0 R09: fffffbfff1514cc1
R10: ffffffff8a8a6607 R11: fffffbfff1514cc0 R12: ffff88809a6df9b0
R13: 0000000000000007 R14: 0000000000000000 R15: ffffffff873a4d00
FS:  0000000001d2b880(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006cd090 CR3: 000000009403a000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 sk_common_release+0xba/0x370 net/core/sock.c:3210
 inet_create net/ipv4/af_inet.c:390 [inline]
 inet_create+0x966/0xe00 net/ipv4/af_inet.c:248
 __sock_create+0x3cb/0x730 net/socket.c:1428
 sock_create net/socket.c:1479 [inline]
 __sys_socket+0xef/0x200 net/socket.c:1521
 __do_sys_socket net/socket.c:1530 [inline]
 __se_sys_socket net/socket.c:1528 [inline]
 __x64_sys_socket+0x6f/0xb0 net/socket.c:1528
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x441e29
Code: e8 fc b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffdce184148 EFLAGS: 00000246 ORIG_RAX: 0000000000000029
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441e29
RDX: 0000000000000073 RSI: 0000000000000002 RDI: 0000000000000002
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000402c30 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace 23b6578228ce553e ]---
RIP: 0010:inet_unhash+0x11f/0x770 net/ipv4/inet_hashtables.c:600
Code: 03 0f b6 04 02 84 c0 74 08 3c 03 0f 8e dd 04 00 00 48 8d 7d 08 44 8b 73 08 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 55 05 00 00 48 8d 7d 14 4c 8b 6d 08 48 b8 00 00
RSP: 0018:ffffc90001777d30 EFLAGS: 00010202
RAX: dffffc0000000000 RBX: ffff88809a6df940 RCX: ffffffff8697c242
RDX: 0000000000000001 RSI: ffffffff8697c251 RDI: 0000000000000008
RBP: 0000000000000000 R08: ffff88809f3ae1c0 R09: fffffbfff1514cc1
R10: ffffffff8a8a6607 R11: fffffbfff1514cc0 R12: ffff88809a6df9b0
R13: 0000000000000007 R14: 0000000000000000 R15: ffffffff873a4d00
FS:  0000000001d2b880(0000) GS:ffff8880ae600000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000006cd090 CR3: 000000009403a000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 0d76751fad ("l2tp: Add L2TPv3 IP encapsulation (no UDP) support")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: James Chapman <jchapman@katalix.com>
Cc: Andrii Nakryiko <andriin@fb.com>
Reported-by: syzbot+3610d489778b57cc8031@syzkaller.appspotmail.com
2020-05-30 21:55:16 -07:00
Paolo Abeni
39884604b1 mptcp: fix NULL ptr dereference in MP_JOIN error path
When token lookup on MP_JOIN 3rd ack fails, the server
socket closes with a reset the incoming child. Such socket
has the 'is_mptcp' flag set, but no msk socket associated
- due to the failed lookup.

While crafting the reset packet mptcp_established_options_mp()
will try to dereference the child's master socket, causing
a NULL ptr dereference.

This change addresses the issue with explicit fallback to
TCP in such error path.

Fixes: 729cd6436f ("mptcp: cope better with MP_JOIN failure")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:54:03 -07:00
Toke Høiland-Jørgensen
b0c19ed608 sch_cake: Take advantage of skb->hash where appropriate
While the other fq-based qdiscs take advantage of skb->hash and doesn't
recompute it if it is already set, sch_cake does not.

This was a deliberate choice because sch_cake hashes various parts of the
packet header to support its advanced flow isolation modes. However,
foregoing the use of skb->hash entirely loses a few important benefits:

- When skb->hash is set by hardware, a few CPU cycles can be saved by not
  hashing again in software.

- Tunnel encapsulations will generally preserve the value of skb->hash from
  before the encapsulation, which allows flow-based qdiscs to distinguish
  between flows even though the outer packet header no longer has flow
  information.

It turns out that we can preserve these desirable properties in many cases,
while still supporting the advanced flow isolation properties of sch_cake.
This patch does so by reusing the skb->hash value as the flow_hash part of
the hashing procedure in cake_hash() only in the following conditions:

- If the skb->hash is marked as covering the flow headers (skb->l4_hash is
  set)

AND

- NAT header rewriting is either disabled, or did not change any values
  used for hashing. The latter is important to match local-origin packets
  such as those of a tunnel endpoint.

The immediate motivation for fixing this was the recent patch to WireGuard
to preserve the skb->hash on encapsulation. As such, this is also what I
tested against; with this patch, added latency under load for competing
flows drops from ~8 ms to sub-1ms on an RRUL test over a WireGuard tunnel
going through a virtual link shaped to 1Gbps using sch_cake. This matches
the results we saw with a similar setup using sch_fq_codel when testing the
WireGuard patch.

Fixes: 046f6fd5da ("sched: Add Common Applications Kept Enhanced (cake) qdisc")
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:52:20 -07:00
Chris Lew
c6e08d6251 net: qrtr: Allocate workqueue before kernel_bind
A null pointer dereference in qrtr_ns_data_ready() is seen if a client
opens a qrtr socket before qrtr_ns_init() can bind to the control port.
When the control port is bound, the ENETRESET error will be broadcasted
and clients will close their sockets. This results in DEL_CLIENT
packets being sent to the ns and qrtr_ns_data_ready() being called
without the workqueue being allocated.

Allocate the workqueue before setting sk_data_ready and binding to the
control port. This ensures that the work and workqueue structs are
allocated and initialized before qrtr_ns_data_ready can be called.

Fixes: 0c2204a4ad ("net: qrtr: Migrate nameservice to kernel from userspace")
Signed-off-by: Chris Lew <clew@codeaurora.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:43:13 -07:00
Paolo Abeni
c5c79763fa mptcp: remove msk from the token container at destruction time.
Currently we remote the msk from the token container only
via mptcp_close(). The MPTCP master socket can be destroyed
also via other paths (e.g. if not yet accepted, when shutting
down the listener socket). When we hit the latter scenario,
dangling msk references are left into the token container,
leading to memory corruption and/or UaF.

This change addresses the issue by moving the token removal
into the msk destructor.

Fixes: 79c0949e9a ("mptcp: Add key generation and token tree")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:39:13 -07:00
Paolo Abeni
10f6d46c94 mptcp: fix race between MP_JOIN and close
If a MP_JOIN subflow completes the 3whs while another
CPU is closing the master msk, we can hit the
following race:

CPU1                                    CPU2

close()
 mptcp_close
                                        subflow_syn_recv_sock
                                         mptcp_token_get_sock
                                         mptcp_finish_join
                                          inet_sk_state_load
  mptcp_token_destroy
  inet_sk_state_store(TCP_CLOSE)
  __mptcp_flush_join_list()
                                          mptcp_sock_graft
                                          list_add_tail
  sk_common_release
   sock_orphan()
 <socket free>

The MP_JOIN socket will be leaked. Additionally we can hit
UaF for the msk 'struct socket' referenced via the 'conn'
field.

This change try to address the issue introducing some
synchronization between the MP_JOIN 3whs and mptcp_close
via the join_list spinlock. If we detect the msk is closing
the MP_JOIN socket is closed, too.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:39:13 -07:00
Paolo Abeni
41be81a8d3 mptcp: fix unblocking connect()
Currently unblocking connect() on MPTCP sockets fails frequently.
If mptcp_stream_connect() is invoked to complete a previously
attempted unblocking connection, it will still try to create
the first subflow via __mptcp_socket_create(). If the 3whs is
completed and the 'can_ack' flag is already set, the latter
will fail with -EINVAL.

This change addresses the issue checking for pending connect and
delegating the completion to the first subflow. Additionally
do msk addresses and sk_state changes only when needed.

Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 21:39:13 -07:00
Karsten Graul
b8ded9de8d net/smc: pre-fetch send buffer outside of send_lock
Pre-fetch send buffer for the CDC validation message before entering the
send_lock. Without that the send call might fail with -EBUSY because
there are no free buffers and waiting for buffers is not possible under
send_lock.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 18:12:25 -07:00
wenxu
05aa69e5cb net/sched: act_ct: add nat mangle action only for NAT-conntrack
Currently add nat mangle action with comparing invert and orig tuple.
It is better to check IPS_NAT_MASK flags first to avoid non necessary
memcmp for non-NAT conntrack.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 17:57:58 -07:00
David S. Miller
4300c7e7fe mlx5-cleanup-2020-05-29
Accumulated cleanup patches and sparse warning fixes for mlx5 driver.
 
 1) sync with mlx5-next branch
 
 2) Eli Cohen declares mpls_entry_encode() helper in mpls.h as suggested
 by Jakub Kicinski and David Ahern, and use it in mlx5
 
 3) Jesper Fixes xdp data_meta setup in mlx5
 
 4) Many sparse and build warnings cleanup
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl7R3wcACgkQSD+KveBX
 +j6/ZQf/QD39naPeImfLjemkRK9L+TKbS4nU6wpUwf1jC33Wdm4HhkhsWEnR6C4l
 OwU/Pae3I9EtKP4gRE0W1o8h7zC9h4hY7+IKZOdyQ32iUY55PX/H25oqAiCj1NCM
 xzWpXOTwK/vkqmkCedAd+YpNdYlbOhfycr+KVPSsvFdaPqjzfNO1PJcLsUbAbzrX
 A+8pYdhUYTtx1N3YHJL5abLN6WzMAKxgwlm9GG8YCXACTJT6CBWWHGebVsC5TDUk
 Lj5hJj38mI8/3dcu6vWP0kLGVfRZo0HS/gpPGxbKQFpP+1uBYaRENAQONxkY++6S
 GDPix7ccvN+yNMlON893PC/Cogw3Yg==
 =WaCJ
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-cleanup-2020-05-29' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-cleanup-2020-05-29

Accumulated cleanup patches and sparse warning fixes for mlx5 driver.

1) sync with mlx5-next branch

2) Eli Cohen declares mpls_entry_encode() helper in mpls.h as suggested
by Jakub Kicinski and David Ahern, and use it in mlx5

3) Jesper Fixes xdp data_meta setup in mlx5

4) Many sparse and build warnings cleanup
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 17:53:57 -07:00
Yang Yingliang
1b49cd71b5 devinet: fix memleak in inetdev_init()
When devinet_sysctl_register() failed, the memory allocated
in neigh_parms_alloc() should be freed.

Fixes: 20e61da7ff ("ipv4: fail early when creating netdev named all or default")
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 17:48:56 -07:00
Jia He
8692cefc43 virtio_vsock: Fix race condition in virtio_transport_recv_pkt
When client on the host tries to connect(SOCK_STREAM, O_NONBLOCK) to the
server on the guest, there will be a panic on a ThunderX2 (armv8a server):

[  463.718844] Unable to handle kernel NULL pointer dereference at virtual address 0000000000000000
[  463.718848] Mem abort info:
[  463.718849]   ESR = 0x96000044
[  463.718852]   EC = 0x25: DABT (current EL), IL = 32 bits
[  463.718853]   SET = 0, FnV = 0
[  463.718854]   EA = 0, S1PTW = 0
[  463.718855] Data abort info:
[  463.718856]   ISV = 0, ISS = 0x00000044
[  463.718857]   CM = 0, WnR = 1
[  463.718859] user pgtable: 4k pages, 48-bit VAs, pgdp=0000008f6f6e9000
[  463.718861] [0000000000000000] pgd=0000000000000000
[  463.718866] Internal error: Oops: 96000044 [#1] SMP
[...]
[  463.718977] CPU: 213 PID: 5040 Comm: vhost-5032 Tainted: G           O      5.7.0-rc7+ #139
[  463.718980] Hardware name: GIGABYTE R281-T91-00/MT91-FS1-00, BIOS F06 09/25/2018
[  463.718982] pstate: 60400009 (nZCv daif +PAN -UAO)
[  463.718995] pc : virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common]
[  463.718999] lr : virtio_transport_recv_pkt+0x1fc/0xd40 [vmw_vsock_virtio_transport_common]
[  463.719000] sp : ffff80002dbe3c40
[...]
[  463.719025] Call trace:
[  463.719030]  virtio_transport_recv_pkt+0x4c8/0xd40 [vmw_vsock_virtio_transport_common]
[  463.719034]  vhost_vsock_handle_tx_kick+0x360/0x408 [vhost_vsock]
[  463.719041]  vhost_worker+0x100/0x1a0 [vhost]
[  463.719048]  kthread+0x128/0x130
[  463.719052]  ret_from_fork+0x10/0x18

The race condition is as follows:
Task1                                Task2
=====                                =====
__sock_release                       virtio_transport_recv_pkt
  __vsock_release                      vsock_find_bound_socket (found sk)
    lock_sock_nested
    vsock_remove_sock
    sock_orphan
      sk_set_socket(sk, NULL)
    sk->sk_shutdown = SHUTDOWN_MASK
    ...
    release_sock
                                    lock_sock
                                       virtio_transport_recv_connecting
                                         sk->sk_socket->state (panic!)

The root cause is that vsock_find_bound_socket can't hold the lock_sock,
so there is a small race window between vsock_find_bound_socket() and
lock_sock(). If __vsock_release() is running in another task,
sk->sk_socket will be set to NULL inadvertently.

This fixes it by checking sk->sk_shutdown(suggested by Stefano) after
lock_sock since sk->sk_shutdown is set to SHUTDOWN_MASK under the
protection of lock_sock_nested.

Signed-off-by: Jia He <justin.he@arm.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-30 17:44:01 -07:00
Eli Cohen
86ae579cef net: Make mpls_entry_encode() available for generic users
Move mpls_entry_encode() from net/mpls/internal.h to include/net/mpls.h
and make it available for other users. Specifically, hardware driver that
offload MPLS can benefit from that.

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Eli Cohen <eli@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-29 21:20:20 -07:00
Florian Westphal
bc183dec08 tcp: tcp_init_buffer_space can be static
As of commit 98fa6271cf
("tcp: refactor setting the initial congestion window") this is called
only from tcp_input.c, so it can be static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 17:31:10 -07:00
Andrew Lunn
fd55199d3b net: ethtool: cabletest: Make ethnl_act_cable_test_tdr_cfg static
kbuild test robot is reporting:
net/ethtool/cabletest.c:230:5: warning: no previous prototype for

Mark the function as static.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 17:28:30 -07:00
YueHaibing
8298a419a0 tipc: remove set but not used variable 'prev'
Fixes gcc '-Wunused-but-set-variable' warning:

net/tipc/msg.c: In function 'tipc_msg_append':
net/tipc/msg.c:215:24: warning:
 variable 'prev' set but not used [-Wunused-but-set-variable]

commit 0a3e060f34 ("tipc: add test for Nagle algorithm effectiveness")
left behind this, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 16:59:04 -07:00
Hangbin Liu
96d10d5b19 neigh: fix ARP retransmit timer guard
In commit 19e16d220f ("neigh: support smaller retrans_time settting")
we add more accurate control for ARP and NS. But for ARP I forgot to
update the latest guard in neigh_timer_handler(), then the next
retransmit would be reset to jiffies + HZ/2 if we set the retrans_time
less than 500ms. Fix it by setting the time_before() check to HZ/100.

IPv6 does not have this issue.

Reported-by: Jianwen Ji <jiji@redhat.com>
Fixes: 19e16d220f ("neigh: support smaller retrans_time settting")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 16:56:53 -07:00
Vladimir Oltean
04198499b2 net: dsa: tag_8021q: stop restoring VLANs from bridge
Right now, our only tag_8021q user, sja1105, has the ability to restore
bridge VLANs on its own, so this logic is unnecessary.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 16:45:46 -07:00
David S. Miller
f9e0ce3ddc Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-05-29

The following pull-request contains BPF updates for your *net* tree.

We've added 6 non-merge commits during the last 7 day(s) which contain
a total of 4 files changed, 55 insertions(+), 34 deletions(-).

The main changes are:

1) minor verifier fix for fmod_ret progs, from Alexei.

2) af_xdp overflow check, from Bjorn.

3) minor verifier fix for 32bit assignment, from John.

4) powerpc has non-overlapping addr space, from Petr.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 15:59:08 -07:00
Christoph Hellwig
5a892ff2fa net: remove kernel_setsockopt
No users left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:10:39 -07:00
Christoph Hellwig
c0425a4249 net: add a new bind_add method
The SCTP protocol allows to bind multiple address to a socket.  That
feature is currently only exposed as a socket option.  Add a bind_add
method struct proto that allows to bind additional addresses, and
switch the dlm code to use the method instead of going through the
socket option from kernel space.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:10:39 -07:00
Christoph Hellwig
05bfd36614 sctp: refactor sctp_setsockopt_bindx
Split out a sctp_setsockopt_bindx_kernel that takes a kernel pointer
to the sockaddr and make sctp_setsockopt_bindx a small wrapper around
it.  This prepares for adding a new bind_add proto op.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:10:39 -07:00
David S. Miller
942110fdf2 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2020-05-29

1) Several fixes for ESP gro/gso in transport and beet mode when
   IPv6 extension headers are present. From Xin Long.

2) Fix a wrong comment on XFRMA_OFFLOAD_DEV.
   From Antony Antony.

3) Fix sk_destruct callback handling on ESP in TCP encapsulation.
   From Sabrina Dubroca.

4) Fix a use after free in xfrm_output_gso when used with vxlan.
   From Xin Long.

5) Fix secpath handling of VTI when used wiuth IPCOMP.
   From Xin Long.

6) Fix an oops when deleting a x-netns xfrm interface.
   From Nicolas Dichtel.

7) Fix a possible warning on policy updates. We had a case where it was
   possible to add two policies with the same lookup keys.
   From Xin Long.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:05:56 -07:00
David S. Miller
f26e9b2c0b Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-05-29

1) Add IPv6 encapsulation support for ESP over UDP and TCP.
   From Sabrina Dubroca.

2) Remove unneeded reference when initializing xfrm interfaces.
   From Nicolas Dichtel.

3) Remove some indirect calls from the state_afinfo.
   From Florian Westphal.

Please note that this pull request has two merge conflicts

between commit:

  0c922a4850 ("xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finish")

  from Linus' tree and commit:

    2ab6096db2 ("xfrm: remove output_finish indirection from xfrm_state_afinfo")

    from the ipsec-next tree.

and between commit:

  3986912f6a ("ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl")

  from the net-next tree and commit:

    0146dca70b ("xfrm: add support for UDPv6 encapsulation of ESP")

    from the ipsec-next tree.

Both conflicts can be resolved as done in linux-next.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-29 13:02:33 -07:00
Sebastian Andrzej Siewior
e6da0edc24 Bluetooth: Acquire sk_lock.slock without disabling interrupts
There was a lockdep which led to commit
   fad003b6c8 ("Bluetooth: Fix inconsistent lock state with RFCOMM")

Lockdep noticed that `sk->sk_lock.slock' was acquired without disabling
the softirq while the lock was also used in softirq context.
Unfortunately the solution back then was to disable interrupts before
acquiring the lock which however made lockdep happy.
It would have been enough to simply disable the softirq. Disabling
interrupts before acquiring a spinlock_t is not allowed on PREEMPT_RT
because these locks are converted to 'sleeping' spinlocks.

Use spin_lock_bh() in order to acquire the `sk_lock.slock'.

Reported-by: Luis Claudio R. Goncalves <lclaudio@uudg.org>
Reported-by: kbuild test robot <lkp@intel.com> [missing unlock]
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-29 13:48:46 +02:00
Xin Long
f6a23d85d0 xfrm: fix a NULL-ptr deref in xfrm_local_error
This patch is to fix a crash:

  [ ] kasan: GPF could be caused by NULL-ptr deref or user memory access
  [ ] general protection fault: 0000 [#1] SMP KASAN PTI
  [ ] RIP: 0010:ipv6_local_error+0xac/0x7a0
  [ ] Call Trace:
  [ ]  xfrm6_local_error+0x1eb/0x300
  [ ]  xfrm_local_error+0x95/0x130
  [ ]  __xfrm6_output+0x65f/0xb50
  [ ]  xfrm6_output+0x106/0x46f
  [ ]  udp_tunnel6_xmit_skb+0x618/0xbf0 [ip6_udp_tunnel]
  [ ]  vxlan_xmit_one+0xbc6/0x2c60 [vxlan]
  [ ]  vxlan_xmit+0x6a0/0x4276 [vxlan]
  [ ]  dev_hard_start_xmit+0x165/0x820
  [ ]  __dev_queue_xmit+0x1ff0/0x2b90
  [ ]  ip_finish_output2+0xd3e/0x1480
  [ ]  ip_do_fragment+0x182d/0x2210
  [ ]  ip_output+0x1d0/0x510
  [ ]  ip_send_skb+0x37/0xa0
  [ ]  raw_sendmsg+0x1b4c/0x2b80
  [ ]  sock_sendmsg+0xc0/0x110

This occurred when sending a v4 skb over vxlan6 over ipsec, in which case
skb->protocol == htons(ETH_P_IPV6) while skb->sk->sk_family == AF_INET in
xfrm_local_error(). Then it will go to xfrm6_local_error() where it tries
to get ipv6 info from a ipv4 sk.

This issue was actually fixed by Commit 628e341f31 ("xfrm: make local
error reporting more robust"), but brought back by Commit 844d48746e
("xfrm: choose protocol family by skb protocol").

So to fix it, we should call xfrm6_local_error() only when skb->protocol
is htons(ETH_P_IPV6) and skb->sk->sk_family is AF_INET6.

Fixes: 844d48746e ("xfrm: choose protocol family by skb protocol")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-29 12:10:22 +02:00
Xiyu Yang
037e910b52 SUNRPC: Remove unreachable error condition in rpcb_getport_async()
rpcb_getport_async() invokes rpcb_call_async(), which return the value
of rpc_run_task() to "child". Since rpc_run_task() is impossible to
return an ERR pointer, there is no need to add the IS_ERR() condition on
"child" here. So we need to remove it.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-28 18:15:00 -04:00
NeilBrown
24c5efe41c sunrpc: clean up properly in gss_mech_unregister()
gss_mech_register() calls svcauth_gss_register_pseudoflavor() for each
flavour, but gss_mech_unregister() does not call auth_domain_put().
This is unbalanced and makes it impossible to reload the module.

Change svcauth_gss_register_pseudoflavor() to return the registered
auth_domain, and save it for later release.

Cc: stable@vger.kernel.org (v2.6.12+)
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-28 18:15:00 -04:00
NeilBrown
d47a5dc288 sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations.
There is no valid case for supporting duplicate pseudoflavor
registrations.
Currently the silent acceptance of such registrations is hiding a bug.
The rpcsec_gss_krb5 module registers 2 flavours but does not unregister
them, so if you load, unload, reload the module, it will happily
continue to use the old registration which now has pointers to the
memory were the module was originally loaded.  This could lead to
unexpected results.

So disallow duplicate registrations.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
Cc: stable@vger.kernel.org (v2.6.12+)
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-28 18:15:00 -04:00
NeilBrown
f45db2b909 sunrpc: check that domain table is empty at module unload.
The domain table should be empty at module unload.  If it isn't there is
a bug somewhere.  So check and report.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=206651
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-28 18:15:00 -04:00
Jonas Falkevik
45ebf73ebc sctp: check assoc before SCTP_ADDR_{MADE_PRIM, ADDED} event
Make sure SCTP_ADDR_{MADE_PRIM,ADDED} are sent only for associations
that have been established.

These events are described in rfc6458#section-6.1
SCTP_PEER_ADDR_CHANGE:
This tag indicates that an address that is
part of an existing association has experienced a change of
state (e.g., a failure or return to service of the reachability
of an endpoint via a specific transport address).

Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 12:47:02 -07:00
Christoph Hellwig
095ae61253 tipc: call tsk_set_importance from tipc_topsrv_create_listener
Avoid using kernel_setsockopt for the TIPC_IMPORTANCE option when we can
just use the internal helper.  The only change needed is to pass a struct
sock instead of tipc_sock, which is private to socket.c

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:46 -07:00
Christoph Hellwig
298cd88a66 rxrpc: add rxrpc_sock_set_min_security_level
Add a helper to directly set the RXRPC_MIN_SECURITY_LEVEL sockopt from
kernel space without going through a fake uaccess.

Thanks to David Howells for the documentation updates.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:46 -07:00
Christoph Hellwig
7d7207c2d5 ipv6: add ip6_sock_set_recvpktinfo
Add a helper to directly set the IPV6_RECVPKTINFO sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:46 -07:00
Christoph Hellwig
18d5ad6232 ipv6: add ip6_sock_set_addr_preferences
Add a helper to directly set the IPV6_ADD_PREFERENCES sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:46 -07:00
Christoph Hellwig
fce934949c ipv6: add ip6_sock_set_recverr
Add a helper to directly set the IPV6_RECVERR sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
9b115749ac ipv6: add ip6_sock_set_v6only
Add a helper to directly set the IPV6_V6ONLY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
c1f9ec5776 ipv4: add ip_sock_set_pktinfo
Add a helper to directly set the IP_PKTINFO sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
2de569bda2 ipv4: add ip_sock_set_mtu_discover
Add a helper to directly set the IP_MTU_DISCOVER sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com> [rxrpc bits]
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
db45c0ef25 ipv4: add ip_sock_set_recverr
Add a helper to directly set the IP_RECVERR sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
c4e446bf5a ipv4: add ip_sock_set_freebind
Add a helper to directly set the IP_FREEBIND sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
6ebf71bab9 ipv4: add ip_sock_set_tos
Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
480aeb9639 tcp: add tcp_sock_set_keepcnt
Add a helper to directly set the TCP_KEEPCNT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
d41ecaac90 tcp: add tcp_sock_set_keepintvl
Add a helper to directly set the TCP_KEEPINTVL sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
71c48eb81c tcp: add tcp_sock_set_keepidle
Add a helper to directly set the TCP_KEEP_IDLE sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
c488aeadcb tcp: add tcp_sock_set_user_timeout
Add a helper to directly set the TCP_USER_TIMEOUT sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
557eadfcc5 tcp: add tcp_sock_set_syncnt
Add a helper to directly set the TCP_SYNCNT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
ddd061b8da tcp: add tcp_sock_set_quickack
Add a helper to directly set the TCP_QUICKACK sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
12abc5ee78 tcp: add tcp_sock_set_nodelay
Add a helper to directly set the TCP_NODELAY sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Acked-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
db10538a4b tcp: add tcp_sock_set_cork
Add a helper to directly set the TCP_CORK sockopt from kernel space
without going through a fake uaccess.  Cleanup the callers to avoid
pointless wrappers now that this is a simple function call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
fe31a326a4 net: add sock_set_reuseport
Add a helper to directly set the SO_REUSEPORT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
26cfabf9cd net: add sock_set_rcvbuf
Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
ce3d9544ce net: add sock_set_keepalive
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
783da70e83 net: add sock_enable_timestamps
Add a helper to directly enable timestamps instead of setting the
SO_TIMESTAMP* sockopts from kernel space and going through a fake
uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
7594888c78 net: add sock_bindtoindex
Add a helper to directly set the SO_BINDTOIFINDEX sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
76ee0785f4 net: add sock_set_sndtimeo
Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel
space without going through a fake uaccess.  The interface is
simplified to only pass the seconds value, as that is the only
thing needed at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
6e43496745 net: add sock_set_priority
Add a helper to directly set the SO_PRIORITY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
c433594c07 net: add sock_no_linger
Add a helper to directly set the SO_LINGER sockopt from kernel space
with onoff set to true and a linger time of 0 without going through a
fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
b58f0e8f38 net: add sock_set_reuseaddr
Add a helper to directly set the SO_REUSEADDR sockopt from kernel space
without going through a fake uaccess.

For this the iscsi target now has to formally depend on inet to avoid
a mostly theoretical compile failure.  For actual operation it already
did depend on having ipv4 or ipv6 support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Eric Dumazet
d29245692a tcp: ipv6: support RFC 6069 (TCP-LD)
Make tcp_ld_RTO_revert() helper available to IPv6, and
implement RFC 6069 :

Quoting this RFC :

3. Connectivity Disruption Indication

   For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
   the ICMP destination unreachable message of code 0 (net unreachable)
   and of code 1 (host unreachable) is the ICMPv6 destination
   unreachable message of code 0 (no route to destination) [RFC4443].
   As with IPv4, a router should generate an ICMPv6 destination
   unreachable message of code 0 in response to a packet that cannot be
   delivered to its destination address because it lacks a matching
   entry in its routing table.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:02:46 -07:00
David S. Miller
2200313aa0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Uninitialized when used in __nf_conntrack_update(), from
   Nathan Chancellor.

2) Comparison of unsigned expression in nf_confirm_cthelper().

3) Remove 'const' type qualifier with no effect.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 10:54:18 -07:00
Markus Theil
a7528198ad mac80211: support control port TX status reporting
Add support for TX status reporting for the control port
TX API; this will be used by hostapd when it moves to the
control port TX API.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200527160334.19224-1-markus.theil@tu-ilmenau.de
[fix commit message, it was referring to nl80211]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-28 09:02:14 +02:00
Christoph Hellwig
7a15b2e013 net: remove kernel_getsockopt
No users left.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:11:33 -07:00
Vladimir Oltean
2b86cb8299 net: dsa: declare lockless TX feature for slave ports
Be there a platform with the following layout:

      Regular NIC
       |
       +----> DSA master for switch port
               |
               +----> DSA master for another switch port

After changing DSA back to static lockdep class keys in commit
1a33e10e4a ("net: partially revert dynamic lockdep key changes"), this
kernel splat can be seen:

[   13.361198] ============================================
[   13.366524] WARNING: possible recursive locking detected
[   13.371851] 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988 Not tainted
[   13.377874] --------------------------------------------
[   13.383201] swapper/0/0 is trying to acquire lock:
[   13.388004] ffff0000668ff298 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[   13.397879]
[   13.397879] but task is already holding lock:
[   13.403727] ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[   13.413593]
[   13.413593] other info that might help us debug this:
[   13.420140]  Possible unsafe locking scenario:
[   13.420140]
[   13.426075]        CPU0
[   13.428523]        ----
[   13.430969]   lock(&dsa_slave_netdev_xmit_lock_key);
[   13.435946]   lock(&dsa_slave_netdev_xmit_lock_key);
[   13.440924]
[   13.440924]  *** DEADLOCK ***
[   13.440924]
[   13.446860]  May be due to missing lock nesting notation
[   13.446860]
[   13.453668] 6 locks held by swapper/0/0:
[   13.457598]  #0: ffff800010003de0 ((&idev->mc_ifc_timer)){+.-.}-{0:0}, at: call_timer_fn+0x0/0x400
[   13.466593]  #1: ffffd4d3fb478700 (rcu_read_lock){....}-{1:2}, at: mld_sendpack+0x0/0x560
[   13.474803]  #2: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: ip6_finish_output2+0x64/0xb10
[   13.483886]  #3: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[   13.492793]  #4: ffff0000661a1698 (&dsa_slave_netdev_xmit_lock_key){+.-.}-{2:2}, at: __dev_queue_xmit+0x84c/0xbe0
[   13.503094]  #5: ffffd4d3fb478728 (rcu_read_lock_bh){....}-{1:2}, at: __dev_queue_xmit+0x6c/0xbe0
[   13.512000]
[   13.512000] stack backtrace:
[   13.516369] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.7.0-rc4-02121-gc32a05ecd7af-dirty #988
[   13.530421] Call trace:
[   13.532871]  dump_backtrace+0x0/0x1d8
[   13.536539]  show_stack+0x24/0x30
[   13.539862]  dump_stack+0xe8/0x150
[   13.543271]  __lock_acquire+0x1030/0x1678
[   13.547290]  lock_acquire+0xf8/0x458
[   13.550873]  _raw_spin_lock+0x44/0x58
[   13.554543]  __dev_queue_xmit+0x84c/0xbe0
[   13.558562]  dev_queue_xmit+0x24/0x30
[   13.562232]  dsa_slave_xmit+0xe0/0x128
[   13.565988]  dev_hard_start_xmit+0xf4/0x448
[   13.570182]  __dev_queue_xmit+0x808/0xbe0
[   13.574200]  dev_queue_xmit+0x24/0x30
[   13.577869]  neigh_resolve_output+0x15c/0x220
[   13.582237]  ip6_finish_output2+0x244/0xb10
[   13.586430]  __ip6_finish_output+0x1dc/0x298
[   13.590709]  ip6_output+0x84/0x358
[   13.594116]  mld_sendpack+0x2bc/0x560
[   13.597786]  mld_ifc_timer_expire+0x210/0x390
[   13.602153]  call_timer_fn+0xcc/0x400
[   13.605822]  run_timer_softirq+0x588/0x6e0
[   13.609927]  __do_softirq+0x118/0x590
[   13.613597]  irq_exit+0x13c/0x148
[   13.616918]  __handle_domain_irq+0x6c/0xc0
[   13.621023]  gic_handle_irq+0x6c/0x160
[   13.624779]  el1_irq+0xbc/0x180
[   13.627927]  cpuidle_enter_state+0xb4/0x4d0
[   13.632120]  cpuidle_enter+0x3c/0x50
[   13.635703]  call_cpuidle+0x44/0x78
[   13.639199]  do_idle+0x228/0x2c8
[   13.642433]  cpu_startup_entry+0x2c/0x48
[   13.646363]  rest_init+0x1ac/0x280
[   13.649773]  arch_call_rest_init+0x14/0x1c
[   13.653878]  start_kernel+0x490/0x4bc

Lockdep keys themselves were added in commit ab92d68fc2 ("net: core:
add generic lockdep keys"), and it's very likely that this splat existed
since then, but I have no real way to check, since this stacked platform
wasn't supported by mainline back then.

>From Taehee's own words:

  This patch was considered that all stackable devices have LLTX flag.
  But the dsa doesn't have LLTX, so this splat happened.
  After this patch, dsa shares the same lockdep class key.
  On the nested dsa interface architecture, which you illustrated,
  the same lockdep class key will be used in __dev_queue_xmit() because
  dsa doesn't have LLTX.
  So that lockdep detects deadlock because the same lockdep class key is
  used recursively although actually the different locks are used.
  There are some ways to fix this problem.

  1. using NETIF_F_LLTX flag.
  If possible, using the LLTX flag is a very clear way for it.
  But I'm so sorry I don't know whether the dsa could have LLTX or not.

  2. using dynamic lockdep again.
  It means that each interface uses a separate lockdep class key.
  So, lockdep will not detect recursive locking.
  But this way has a problem that it could consume lockdep class key
  too many.
  Currently, lockdep can have 8192 lockdep class keys.
   - you can see this number with the following command.
     cat /proc/lockdep_stats
     lock-classes:                         1251 [max: 8192]
     ...
     The [max: 8192] means that the maximum number of lockdep class keys.
  If too many lockdep class keys are registered, lockdep stops to work.
  So, using a dynamic(separated) lockdep class key should be considered
  carefully.
  In addition, updating lockdep class key routine might have to be existing.
  (lockdep_register_key(), lockdep_set_class(), lockdep_unregister_key())

  3. Using lockdep subclass.
  A lockdep class key could have 8 subclasses.
  The different subclass is considered different locks by lockdep
  infrastructure.
  But "lock-classes" is not counted by subclasses.
  So, it could avoid stopping lockdep infrastructure by an overflow of
  lockdep class keys.
  This approach should also have an updating lockdep class key routine.
  (lockdep_set_subclass())

  4. Using nonvalidate lockdep class key.
  The lockdep infrastructure supports nonvalidate lockdep class key type.
  It means this lockdep is not validated by lockdep infrastructure.
  So, the splat will not happen but lockdep couldn't detect real deadlock
  case because lockdep really doesn't validate it.
  I think this should be used for really special cases.
  (lockdep_set_novalidate_class())

Further discussion here:
https://patchwork.ozlabs.org/project/netdev/patch/20200503052220.4536-2-xiyou.wangcong@gmail.com/

There appears to be no negative side-effect to declaring lockless TX for
the DSA virtual interfaces, which means they handle their own locking.
So that's what we do to make the splat go away.

Patch tested in a wide variety of cases: unicast, multicast, PTP, etc.

Fixes: ab92d68fc2 ("net: core: add generic lockdep keys")
Suggested-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:09:28 -07:00
Jonas Falkevik
50ce4c099b sctp: fix typo sctp_ulpevent_nofity_peer_addr_change
change typo in function name "nofity" to "notify"
sctp_ulpevent_nofity_peer_addr_change ->
sctp_ulpevent_notify_peer_addr_change

Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:08:02 -07:00
Tariq Toukan
b3ae2459f8 net/tls: Add force_resync for driver resync
This patch adds a field to the tls rx offload context which enables
drivers to force a send_resync call.

This field can be used by drivers to request a resync at the next
possible tls record. It is beneficial for hardware that provides the
resync sequence number asynchronously. In such cases, the packet that
triggered the resync does not contain the information required for a
resync. Instead, the driver requests resync for all the following
TLS record until the asynchronous notification with the resync request
TCP sequence arrives.

A following series for mlx5e ConnectX-6DX TLS RX offload support will
use this mechanism.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:07:13 -07:00
Cong Wang
759ae57f1b net_sched: get rid of unnecessary dev_qdisc_reset()
Resetting old qdisc on dev_queue->qdisc_sleeping in
dev_qdisc_reset() is redundant, because this qdisc,
even if not same with dev_queue->qdisc, is reset via
qdisc_put() right after calling dev_graft_qdisc() when
hitting refcnt 0.

This is very easy to observe with qdisc_reset() tracepoint
and stack traces.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Tested-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
70f5096533 net_sched: avoid resetting active qdisc for multiple times
Except for sch_mq and sch_mqprio, each dev queue points to the
same root qdisc, so when we reset the dev queues with
netdev_for_each_tx_queue() we end up resetting the same instance
of the root qdisc for multiple times.

Avoid this by checking the __QDISC_STATE_DEACTIVATED bit in
each iteration, so for sch_mq/sch_mqprio, we still reset all
of them like before, for the rest, we only reset it once.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Tested-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
f5a7833e83 net_sched: add a tracepoint for qdisc creation
With this tracepoint, we could know when qdisc's are created,
especially those default qdisc's.

Sample output:

  tc-736   [001] ...1    56.230107: qdisc_create: dev=ens3 kind=pfifo parent=1:0
  tc-736   [001] ...1    56.230113: qdisc_create: dev=ens3 kind=hfsc parent=ffff:ffff
  tc-738   [001] ...1    56.256816: qdisc_create: dev=ens3 kind=pfifo parent=1:100
  tc-739   [001] ...1    56.267584: qdisc_create: dev=ens3 kind=pfifo parent=1:200
  tc-740   [001] ...1    56.279649: qdisc_create: dev=ens3 kind=fq_codel parent=1:100
  tc-741   [001] ...1    56.289996: qdisc_create: dev=ens3 kind=pfifo_fast parent=1:200
  tc-745   [000] .N.1   111.687483: qdisc_create: dev=ens3 kind=ingress parent=ffff:fff1

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
a34dac0b90 net_sched: add tracepoints for qdisc_reset() and qdisc_destroy()
Add two tracepoints for qdisc_reset() and qdisc_destroy() to track
qdisc resetting and destroying.

Sample output:

  tc-756   [000] ...3   138.355662: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355720: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355867: qdisc_reset: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-756   [000] ...1   138.355930: qdisc_destroy: dev=ens3 kind=pfifo_fast parent=ffff:ffff handle=0:0
  tc-757   [000] ...2   143.073780: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.073878: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.074114: qdisc_reset: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0
  tc-757   [000] ...1   143.074228: qdisc_destroy: dev=ens3 kind=fq_codel parent=ffff:ffff handle=8001:0

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Cong Wang
4909daba37 net_sched: use qdisc_reset() in qdisc_destroy()
qdisc_destroy() calls ops->reset() and cleans up qdisc->gso_skb
and qdisc->skb_bad_txq, these are nearly same with qdisc_reset(),
so just call it directly, and cosolidate the code for the next
patch.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:05:49 -07:00
Eric Dumazet
a12daf13a4 tcp: rename tcp_v4_err() skb parameter
This essentially reverts 4d1a2d9ec1 ("Revert Backoff [v3]:
Rename skb to icmp_skb in tcp_v4_err()")

Now we have tcp_ld_RTO_revert() helper, we can use the usual
name for sk_buff parameter, so that tcp_v4_err() and
tcp_v6_err() use similar names.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:57:27 -07:00
Eric Dumazet
f745664257 tcp: add tcp_ld_RTO_revert() helper
RFC 6069 logic has been implemented for IPv4 only so far,
right in the middle of tcp_v4_err() and was error prone.

Move this code to one helper, to make tcp_v4_err() more
readable and to eventually expand RFC 6069 to IPv6 in
the future.

Also perform sock_owned_by_user() check a bit sooner.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 14:57:27 -07:00
Pablo Neira Ayuso
5b6743fb2c netfilter: nf_tables: skip flowtable hooknum and priority on device updates
On device updates, the hooknum and priority attributes are not required.
This patch makes optional these two netlink attributes.

Moreover, bail out with EOPNOTSUPP if userspace tries to update the
hooknum and priority for existing flowtables.

While at this, turn EINVAL into EOPNOTSUPP in case the hooknum is not
ingress. EINVAL is reserved for missing netlink attribute / malformed
netlink messages.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:35 +02:00
Pablo Neira Ayuso
05abe4456f netfilter: nf_tables: allow to register flowtable with no devices
A flowtable might be composed of dynamic interfaces only. Such dynamic
interfaces might show up at a later stage. This patch allows users to
register a flowtable with no devices. Once the dynamic interface becomes
available, the user adds the dynamic devices to the flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
abadb2f865 netfilter: nf_tables: delete devices from flowtable
This patch allows users to delete devices from existing flowtables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
78d9f48f7f netfilter: nf_tables: add devices to existing flowtable
This patch allows users to add devices to an existing flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
c42d8bda69 netfilter: nf_tables: pass hook list to flowtable event notifier
Update the flowtable netlink notifier to take the list of hooks as input.
This allows to reuse this function in incremental flowtable hook updates.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
389a2cbcb7 netfilter: nf_tables: add nft_flowtable_hooks_destroy()
This patch adds a helper function destroy the flowtable hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
f9382669cf netfilter: nf_tables: pass hook list to nft_{un,}register_flowtable_net_hooks()
This patch prepares for incremental flowtable hook updates.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
d9246a5375 netfilter: nf_tables: generalise flowtable hook parsing
Update nft_flowtable_parse_hook() to take the flowtable hook list as
parameter. This allows to reuse this function to update the hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Romain Bellan
cb8aa9a3af netfilter: ctnetlink: add kernel side filtering for dump
Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)

It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).

This patch adds filtering in kernel side. To achieve this goal, we:

 * Add a new CTA_FILTER netlink attributes, actually a flag list to
   parametize filtering
 * Convert some *nlattr_to_tuple() functions, to allow a partial parsing
   of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
   fully set)

Filtering is now possible on:
 * IP SRC/DST values
 * Ports for TCP and UDP flows
 * IMCP(v6) codes types and IDs

Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.

Changes since v1:
  Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered

Changes since v2:
  Move several constants to nf_internals.h
  Move a fix on netlink values check in a separate patch
  Add a check on not-supported flags
  Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
  (not yet implemented)
  Code style issues

Changes since v3:
  Fix compilation warning reported by kbuild test robot

Changes since v4:
  Fix a regression introduced in v3 (returned EINVAL for valid netlink
  messages without CTA_MARK)

Changes since v5:
  Change definition of CTA_FILTER_F_ALL
  Fix a regression when CTA_TUPLE_ZONE is not set

Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Leon Romanovsky
8094ba0ace RDMA/cma: Provide ECE reject reason
IBTA declares "vendor option not supported" reject reason in REJ messages
if passive side doesn't want to accept proposed ECE options.

Due to the fact that ECE is managed by userspace, there is a need to let
users to provide such rejected reason.

Link: https://lore.kernel.org/r/20200526103304.196371-7-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-27 16:05:05 -03:00
Arnd Bergmann
b3b6a84c6a bridge: multicast: work around clang bug
Clang-10 and clang-11 run into a corner case of the register
allocator on 32-bit ARM, leading to excessive stack usage from
register spilling:

net/bridge/br_multicast.c:2422:6: error: stack frame size of 1472 bytes in function 'br_multicast_get_stats' [-Werror,-Wframe-larger-than=]

Work around this by marking one of the internal functions as
noinline_for_stack.

Link: https://bugs.llvm.org/show_bug.cgi?id=45802#c9
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 11:34:48 -07:00
Horatiu Vultur
20f6a05ef6 bridge: mrp: Rework the MRP netlink interface
This patch reworks the MRP netlink interface. Before, each attribute
represented a binary structure which made it hard to be extended.
Therefore update the MRP netlink interface such that each existing
attribute to be a nested attribute which contains the fields of the
binary structures.
In this way the MRP netlink interface can be extended without breaking
the backwards compatibility. It is also using strict checking for
attributes under the MRP top attribute.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 11:30:43 -07:00
Nathan Chancellor
d8e79f1dbc nexthop: Fix type of event_type in call_nexthop_notifiers
Clang warns:

net/ipv4/nexthop.c:841:30: warning: implicit conversion from enumeration
type 'enum nexthop_event_type' to different enumeration type 'enum
fib_event_type' [-Wenum-conversion]
        call_nexthop_notifiers(net, NEXTHOP_EVENT_DEL, nh);
        ~~~~~~~~~~~~~~~~~~~~~~      ^~~~~~~~~~~~~~~~~
1 warning generated.

Use the right type for event_type so that clang does not warn.

Fixes: 8590ceedb7 ("nexthop: add support for notifiers")
Link: https://github.com/ClangBuiltLinux/linux/issues/1038
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 11:22:37 -07:00
Stefano Garzarella
7e0afbdfd1 vsock: fix timeout in vsock_accept()
The accept(2) is an "input" socket interface, so we should use
SO_RCVTIMEO instead of SO_SNDTIMEO to set the timeout.

So this patch replace sock_sndtimeo() with sock_rcvtimeo() to
use the right timeout in the vsock_accept().

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 11:20:23 -07:00
Davide Caratti
bb2f930d6d net/sched: fix infinite loop in sch_fq_pie
this command hangs forever:

 # tc qdisc add dev eth0 root fq_pie flows 65536

 watchdog: BUG: soft lockup - CPU#1 stuck for 23s! [tc:1028]
 [...]
 CPU: 1 PID: 1028 Comm: tc Not tainted 5.7.0-rc6+ #167
 RIP: 0010:fq_pie_init+0x60e/0x8b7 [sch_fq_pie]
 Code: 4c 89 65 50 48 89 f8 48 c1 e8 03 42 80 3c 30 00 0f 85 2a 02 00 00 48 8d 7d 10 4c 89 65 58 48 89 f8 48 c1 e8 03 42 80 3c 30 00 <0f> 85 a7 01 00 00 48 8d 7d 18 48 c7 45 10 46 c3 23 00 48 89 f8 48
 RSP: 0018:ffff888138d67468 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
 RAX: 1ffff9200018d2b2 RBX: ffff888139c1c400 RCX: ffffffffffffffff
 RDX: 000000000000c5e8 RSI: ffffc900000e5000 RDI: ffffc90000c69590
 RBP: ffffc90000c69580 R08: fffffbfff79a9699 R09: fffffbfff79a9699
 R10: 0000000000000700 R11: fffffbfff79a9698 R12: ffffc90000c695d0
 R13: 0000000000000000 R14: dffffc0000000000 R15: 000000002347c5e8
 FS:  00007f01e1850e40(0000) GS:ffff88814c880000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 000000000067c340 CR3: 000000013864c000 CR4: 0000000000340ee0
 Call Trace:
  qdisc_create+0x3fd/0xeb0
  tc_modify_qdisc+0x3be/0x14a0
  rtnetlink_rcv_msg+0x5f3/0x920
  netlink_rcv_skb+0x121/0x350
  netlink_unicast+0x439/0x630
  netlink_sendmsg+0x714/0xbf0
  sock_sendmsg+0xe2/0x110
  ____sys_sendmsg+0x5b4/0x890
  ___sys_sendmsg+0xe9/0x160
  __sys_sendmsg+0xd3/0x170
  do_syscall_64+0x9a/0x370
  entry_SYSCALL_64_after_hwframe+0x44/0xa9

we can't accept 65536 as a valid number for 'nflows', because the loop on
'idx' in fq_pie_init() will never end. The extack message is correct, but
it doesn't say that 0 is not a valid number for 'flows': while at it, fix
this also. Add a tdc selftest to check correct validation of 'flows'.

CC: Ivan Vecera <ivecera@redhat.com>
Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 11:11:18 -07:00
Fedor Tokarev
118917d696 net: sunrpc: Fix off-by-one issues in 'rpc_ntop6'
Fix off-by-one issues in 'rpc_ntop6':
 - 'snprintf' returns the number of characters which would have been
   written if enough space had been available, excluding the terminating
   null byte. Thus, a return value of 'sizeof(scopebuf)' means that the
   last character was dropped.
 - 'strcat' adds a terminating null byte to the string, thus if len ==
   buflen, the null byte is written past the end of the buffer.

Signed-off-by: Fedor Tokarev <ftokarev@gmail.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-05-27 10:08:26 -04:00
Pablo Neira Ayuso
4946ea5c12 netfilter: nf_conntrack_pptp: fix compilation warning with W=1 build
>> include/linux/netfilter/nf_conntrack_pptp.h:13:20: warning: 'const' type qualifier on return type has no effect [-Wignored-qualifiers]
extern const char *const pptp_msg_name(u_int16_t msg);
^~~~~~

Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 4c559f15ef ("netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 13:39:08 +02:00
Pablo Neira Ayuso
94945ad2b3 netfilter: conntrack: comparison of unsigned in cthelper confirmation
net/netfilter/nf_conntrack_core.c: In function nf_confirm_cthelper:
net/netfilter/nf_conntrack_core.c:2117:15: warning: comparison of unsigned expression in < 0 is always false [-Wtype-limits]
 2117 |   if (protoff < 0 || (frag_off & htons(~0x7)) != 0)
      |               ^

ipv6_skip_exthdr() returns a signed integer.

Reported-by: Colin Ian King <colin.king@canonical.com>
Fixes: 703acd70f2 ("netfilter: nfnetlink_cthelper: unbreak userspace helper support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 13:39:06 +02:00
Nathan Chancellor
46c1e0621a netfilter: conntrack: Pass value of ctinfo to __nf_conntrack_update
Clang warns:

net/netfilter/nf_conntrack_core.c:2068:21: warning: variable 'ctinfo' is
uninitialized when used here [-Wuninitialized]
        nf_ct_set(skb, ct, ctinfo);
                           ^~~~~~
net/netfilter/nf_conntrack_core.c:2024:2: note: variable 'ctinfo' is
declared here
        enum ip_conntrack_info ctinfo;
        ^
1 warning generated.

nf_conntrack_update was split up into nf_conntrack_update and
__nf_conntrack_update, where the assignment of ctinfo is in
nf_conntrack_update but it is used in __nf_conntrack_update.

Pass the value of ctinfo from nf_conntrack_update to
__nf_conntrack_update so that uninitialized memory is not used
and everything works properly.

Fixes: ee04805ff5 ("netfilter: conntrack: make conntrack userspace helpers work again")
Link: https://github.com/ClangBuiltLinux/linux/issues/1039
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 13:38:58 +02:00
Jerry Lee
890bd0f899 libceph: ignore pool overlay and cache logic on redirects
OSD client should ignore cache/overlay flag if got redirect reply.
Otherwise, the client hangs when the cache tier is in forward mode.

[ idryomov: Redirects are effectively deprecated and no longer
  used or tested.  The original tiering modes based on redirects
  are inherently flawed because redirects can race and reorder,
  potentially resulting in data corruption.  The new proxy and
  readproxy tiering modes should be used instead of forward and
  readforward.  Still marking for stable as obviously correct,
  though. ]

Cc: stable@vger.kernel.org
URL: https://tracker.ceph.com/issues/23296
URL: https://tracker.ceph.com/issues/36406
Signed-off-by: Jerry Lee <leisurelysw24@gmail.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-05-27 12:43:35 +02:00
Johannes Berg
c112992433 mac80211: fix HT-Control field reception for management frames
If we receive management frames with an HT-Control field, we cannot
parse them properly, as we assume a fixed length management header.

Since we don't even need the HTC field (for these frames, or really
at all), just remove it at the beginning of RX.

Reported-by: Haggai Abramovsky <haggai.abramovsky@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200526143346.cf5ce70521c5.I333251a084ec4cfe67b7ef7efe2d2f1a33883931@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:03:27 +02:00
Patrick Steinhardt
a3b018febc cfg80211: fix CFG82011_CRDA_SUPPORT still mentioning internal regdb
Back with commit c8c240e284 (cfg80211: reg: remove support for
built-in regdb, 2015-10-15), support for using CFG80211_INTERNAL_REGDB
was removed in favor of loading the regulatory database as firmware
file. The documentation of CFG80211_CRDA_SUPPORT was not adjusted,
though, which is why it still mentions mentions the old way of loading
via the internal regulatory database.

Remove it so that the kernel option only mentions using the firmware
file.

Signed-off-by: Patrick Steinhardt <ps@pks.im>
Link: https://lore.kernel.org/r/c56e60207fbd0512029de8c6276ee00f73491924.1589732954.git.ps@pks.im
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:03:27 +02:00
Tamizh Chelvam
9a5f648862 nl80211: Add support to configure TID specific Tx rate configuration
This patch adds support to configure per TID Tx Rate configuration
through NL80211_TID_CONFIG_ATTR_TX_RATE* attributes. And it uses
nl80211_parse_tx_bitrate_mask api to validate the Tx rate mask.

Signed-off-by: Tamizh Chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1589357504-10175-1-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:03:25 +02:00
Johannes Berg
1ea02224af mac80211: allow SA-QUERY processing in userspace
As discussed with Mathy almost two years ago in
http://lore.kernel.org/r/20180806224857.14853-1-Mathy.Vanhoef@cs.kuleuven.be
we should let userspace process SA-QUERY frames if it
wants to, so that it can handle OCV (operating channel
validation) which mac80211 doesn't know how to.

Evidently I had been expecting Mathy to (re)send such a
patch, but he never did, perhaps expecting me to do it
after our discussion.

In any case, this came up now with OCV getting more
attention, so move the code around as discussed there
to let userspace handle it, and do it properly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200526103131.1f9cf7e5b6db.Iae5b42b09ad2b1cbcbe13492002c43f0d1d51dfc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:04 +02:00
Markus Theil
dca9ca2d58 nl80211: add ability to report TX status for control port TX
This adds the necessary capabilities in nl80211 to allow drivers to
assign a cookie to control port TX frames (returned via extack in
the netlink ACK message of the command) and then later report the
frame's status.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200508144202.7678-2-markus.theil@tu-ilmenau.de
[use extack cookie instead of explicit message, recombine patches]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:04 +02:00
Gustavo A. R. Silva
3c23215ba8 mac80211: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20200507185907.GA15102@embeddedor
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:04 +02:00
Gustavo A. R. Silva
396fba0a59 cfg80211: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20200507183909.GA12993@embeddedor
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:03 +02:00
Thomas Pedersen
2032f3b2f9 nl80211: support scan frequencies in KHz
If the driver advertises NL80211_EXT_FEATURE_SCAN_FREQ_KHZ
userspace can omit NL80211_ATTR_SCAN_FREQUENCIES in favor
of an NL80211_ATTR_SCAN_FREQ_KHZ. To get scan results in
KHz userspace must also set the
NL80211_SCAN_FLAG_FREQ_KHZ.

This lets nl80211 remain compatible with older userspaces
while not requring and sending redundant (and potentially
incorrect) scan frequency sets.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200430172554.18383-4-thomas@adapt-ip.com
[use just nla_nest_start() (not _noflag) for NL80211_ATTR_SCAN_FREQ_KHZ]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:03 +02:00
Thomas Pedersen
942ba88ba9 nl80211: add KHz frequency offset for most wifi commands
cfg80211 recently gained the ability to understand a
frequency offset component in KHz. Expose this in nl80211
through the new attributes NL80211_ATTR_WIPHY_FREQ_OFFSET,
NL80211_FREQUENCY_ATTR_OFFSET,
NL80211_ATTR_CENTER_FREQ1_OFFSET, and
NL80211_BSS_FREQUENCY_OFFSET.

These add support to send and receive a KHz offset
component with the following NL80211 commands:

- NL80211_CMD_FRAME
- NL80211_CMD_GET_SCAN
- NL80211_CMD_AUTHENTICATE
- NL80211_CMD_ASSOCIATE
- NL80211_CMD_CONNECT

Along with any other command which takes a chandef, ie:

- NL80211_CMD_SET_CHANNEL
- NL80211_CMD_SET_WIPHY
- NL80211_CMD_START_AP
- NL80211_CMD_RADAR_DETECT
- NL80211_CMD_NOTIFY_RADAR
- NL80211_CMD_CHANNEL_SWITCH
- NL80211_JOIN_IBSS
- NL80211_CMD_REMAIN_ON_CHANNEL
- NL80211_CMD_JOIN_OCB
- NL80211_CMD_JOIN_MESH
- NL80211_CMD_TDLS_CHANNEL_SWITCH

If the driver advertises a band containing channels with
frequency offset, it must also verify support for
frequency offset channels in its cfg80211 ops, or return
an error.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200430172554.18383-3-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:02 +02:00
Thomas Pedersen
e76fede8bf cfg80211: add KHz variants of frame RX API
Drivers may wish to report the RX frequency in units of
KHz. Provide cfg80211_rx_mgmt_khz() and wrap it with
cfg80211_rx_mgmt() so exisiting drivers which can't report
KHz anyway don't need to change. Add a similar wrapper for
cfg80211_report_obss_beacon() so the frequency units stay
somewhat consistent.

This doesn't actually change the nl80211 API yet.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200430172554.18383-2-thomas@adapt-ip.com
[fix mac80211 calling the non-khz version of obss beacon report,
 drop trace point name changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:02 +02:00
Sergey Matyukevich
c03369558c nl80211: simplify peer specific TID configuration
Current rule for applying TID configuration for specific peer looks overly
complicated. No need to reject new TID configuration when override flag is
specified. Another call with the same TID configuration, but without
override flag, allows to apply new configuration anyway.

Use the same approach as for the 'all peers' case: if override flag is
specified, then reset existing TID configuration and immediately
apply a new one.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200424112905.26770-5-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:02 +02:00
Sergey Matyukevich
33462e6823 cfg80211: add support for TID specific AMSDU configuration
This patch adds support to control per TID MSDU aggregation
using the NL80211_TID_CONFIG_ATTR_AMSDU_CTRL attribute.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200424112905.26770-4-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:01 +02:00
Sergey Matyukevich
60c2ef0ef0 mac80211: fix variable names in TID config methods
Fix all variable names from 'tid' to 'tids' to avoid confusion.
Now this is not TID number, but TID mask.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200424112905.26770-3-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:01 +02:00
Andrew Lunn
f2bc8ad31a net: ethtool: Allow PHY cable test TDR data to configured
Allow the user to configure where on the cable the TDR data should be
retrieved, in terms of first and last sample, and the step between
samples. Also add the ability to ask for TDR data for just one pair.

If this configuration is not provided, it defaults to 1-150m at 1m
intervals for all pairs.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>

v3:
Move the TDR configuration into a structure
Add a range check on step
Use NL_SET_ERR_MSG_ATTR() when appropriate
Move TDR configuration into a nest
Document attributes in the request

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 23:22:21 -07:00
Andrew Lunn
6b4a0fc106 net: ethtool: Add helpers for cable test TDR data
Add helpers for returning raw TDR helpers in netlink messages.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 23:22:20 -07:00
Andrew Lunn
1a644de29f net: ethtool: Add generic parts of cable test TDR
Add the generic parts of the code used to trigger a cable test and
return raw TDR data. Any PHY driver which support this must implement
the new driver op.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>

v2
Update nxp-tja11xx for API change.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 23:21:48 -07:00
Chris Packham
9a233d329e net: sctp: Fix spelling in Kconfig help
Change 'handeled' to 'handled' in the Kconfig help for SCTP.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:32:06 -07:00
Florian Westphal
4e637c70b5 mptcp: attempt coalescing when moving skbs to mptcp rx queue
We can try to coalesce skbs we take from the subflows rx queue with the
tail of the mptcp rx queue.

If successful, the skb head can be discarded early.

We can also free the skb extensions, we do not access them after this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:29:32 -07:00
Dmitry Vyukov
09d0310f07 net/smc: mark smc_pnet_policy as const
Netlink policies are generally declared as const.
This is safer and prevents potential bugs.

Signed-off-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:19:13 -07:00
Paolo Abeni
0a82e230c6 mptcp: avoid NULL-ptr derefence on fallback
In the MPTCP receive path we must cope with TCP fallback
on blocking recvmsg(). Currently in such code path we detect
the fallback condition, but we don't fetch the struct socket
required for fallback.

The above allowed syzkaller to trigger a NULL pointer
dereference:

general protection fault, probably for non-canonical address 0xdffffc0000000004: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000020-0x0000000000000027]
CPU: 1 PID: 7226 Comm: syz-executor523 Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:sock_recvmsg_nosec net/socket.c:886 [inline]
RIP: 0010:sock_recvmsg+0x92/0x110 net/socket.c:904
Code: 5b 41 5c 41 5d 41 5e 41 5f 5d c3 44 89 6c 24 04 e8 53 18 1d fb 4d 8d 6f 20 4c 89 e8 48 c1 e8 03 48 b9 00 00 00 00 00 fc ff df <80> 3c 08 00 74 08 4c 89 ef e8 20 12 5b fb bd a0 00 00 00 49 03 6d
RSP: 0018:ffffc90001077b98 EFLAGS: 00010202
RAX: 0000000000000004 RBX: ffffc90001077dc0 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: ffffffff86565e59 R09: ffffed10115afeaa
R10: ffffed10115afeaa R11: 0000000000000000 R12: 1ffff9200020efbc
R13: 0000000000000020 R14: ffffc90001077de0 R15: 0000000000000000
FS:  00007fc6a3abe700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00000000004d0050 CR3: 00000000969f0000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 mptcp_recvmsg+0x18d5/0x19b0 net/mptcp/protocol.c:891
 inet_recvmsg+0xf6/0x1d0 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:886 [inline]
 sock_recvmsg net/socket.c:904 [inline]
 __sys_recvfrom+0x2f3/0x470 net/socket.c:2057
 __do_sys_recvfrom net/socket.c:2075 [inline]
 __se_sys_recvfrom net/socket.c:2071 [inline]
 __x64_sys_recvfrom+0xda/0xf0 net/socket.c:2071
 do_syscall_64+0xf3/0x1b0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Address the issue initializing the struct socket reference
before entering the fallback code.

Reported-and-tested-by: syzbot+c6bfc3db991edc918432@syzkaller.appspotmail.com
Suggested-by: Ondrej Mosnacek <omosnace@redhat.com>
Fixes: 8ab183deb2 ("mptcp: cope with later TCP fallback")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:18:24 -07:00
David S. Miller
745bd6f44c One batch of changes, containing:
* hwsim improvements from Jouni and myself, to be able to
    test more scenarios easily
  * some more HE (802.11ax) support
  * some initial S1G (sub 1 GHz) work for fractional MHz channels
  * some (action) frame registration updates to help DPP support
  * along with other various improvements/fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl7L1A8ACgkQB8qZga/f
 l8T5RQ/9EUxFqBs0cWojwyad5nkesyl51eOnbvSCJJF14W93s2oMeikCynTPe8Vg
 km36041QZqGbwmU0yWC9Lmm4y3ja5qQGI+QW+vT6tutGQx6FgK5TzUfYXqiFZqf6
 asqkvHpH4VqmbG1KEp0PZjIpW/OVK96pbvtXVnkrcMmjl2JjbRtAhyZQVNtt9ufJ
 6wqKf8e6iYqMIInMFPLX+rl7UEknxDKVcqPbMMJmY8/iM1z9Elkg3rkRSMehC+mE
 8cznZ6BsjAGCbMiA8K9fUo15lcMfZCJ1hAPzkD4TsJtMEJ0gYDo5jDB8TIpr5uoL
 95OnlF8jokJIsO+1g4CyaNSQsmFIuDo84vW8LtGRu9qzTP0UwelxhjZLgE3xlP6b
 W+z5HomxfWkYhJhaNywLP3B1VPtJwX8dL/wpECOWHzNKXG7Rb6GqzUwaCRFb6Jjo
 TmFJ5wLoEZHhsXYO2dvcyTzCUCXviXvfq60a56IyCJN8wDqmcubePv0+NOHUmj3c
 E71NTYymM3j9agdSpXdCFLBXA1OgyIydeSNHuBlaPA4sK6tr4ikUtbOrABjYTaQz
 2BB5fHEi8gs4EiHbSXqLFBot3JHljKJPsSN0wAgzQffN+6Kts9FG6HcrLsL+duDg
 lRdAzRrunE85S0QhsxeVIX216rX4W08sl0B1rJR+dTMX9ByblAk=
 =MVBJ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
One batch of changes, containing:
 * hwsim improvements from Jouni and myself, to be able to
   test more scenarios easily
 * some more HE (802.11ax) support
 * some initial S1G (sub 1 GHz) work for fractional MHz channels
 * some (action) frame registration updates to help DPP support
 * along with other various improvements/fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:17:35 -07:00
David Ahern
1fd1c768f3 ipv4: nexthop version of fib_info_nh_uses_dev
Similar to the last path, need to fix fib_info_nh_uses_dev for
external nexthops to avoid referencing multiple nh_grp structs.
Move the device check in fib_info_nh_uses_dev to a helper and
create a nexthop version that is called if the fib_info uses an
external nexthop.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:07 -07:00
David Ahern
af7888ad9e ipv4: Refactor nhc evaluation in fib_table_lookup
FIB lookups can return an entry that references an external nexthop.
While walking the nexthop struct we do not want to make multiple calls
into the nexthop code which can result in 2 different structs getting
accessed - one returning the number of paths the rest of the loop
seeing a different nh_grp struct. If the nexthop group shrunk, the
result is an attempt to access a fib_nh_common that does not exist for
the new nh_grp struct but did for the old one.

To fix that move the device evaluation code to a helper that can be
used for inline fib_nh path as well as external nexthops.

Update the existing check for fi->nh in fib_table_lookup to call a
new helper, nexthop_get_nhc_lookup, which walks the external nexthop
with a single rcu dereference.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:07 -07:00
Nikolay Aleksandrov
90f33bffa3 nexthops: don't modify published nexthop groups
We must avoid modifying published nexthop groups while they might be
in use, otherwise we might see NULL ptr dereferences. In order to do
that we allocate 2 nexthoup group structures upon nexthop creation
and swap between them when we have to delete an entry. The reason is
that we can't fail nexthop group removal, so we can't handle allocation
failure thus we move the extra allocation on creation where we can
safely fail and return ENOMEM.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:06 -07:00
David Ahern
ac21753a5c nexthops: Move code from remove_nexthop_from_groups to remove_nh_grp_entry
Move nh_grp dereference and check for removing nexthop group due to
all members gone into remove_nh_grp_entry.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:06 -07:00
Guillaume Nault
61aec25a6d cls_flower: Support filtering on multiple MPLS Label Stack Entries
With struct flow_dissector_key_mpls now recording the first
FLOW_DIS_MPLS_MAX labels, we can extend Flower to filter on any of
these LSEs independently.

In order to avoid creating new netlink attributes for every possible
depth, let's define a new TCA_FLOWER_KEY_MPLS_OPTS nested attribute
that contains the list of LSEs to match. Each LSE is represented by
another attribute, TCA_FLOWER_KEY_MPLS_OPTS_LSE, which then contains
the attributes representing the depth and the MPLS fields to match at
this depth (label, TTL, etc.).

For each MPLS field, the mask is always set to all-ones, as this is
what the original API did. We could allow user configurable masks in
the future if there is demand for more flexibility.

The new API also allows to only specify an LSE depth. In that case,
Flower only verifies that the MPLS label stack depth is greater or
equal to the provided depth (that is, an LSE exists at this depth).

Filters that only match on one (or more) fields of the first LSE are
dumped using the old netlink attributes, to avoid confusing user space
programs that don't understand the new API.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:22:58 -07:00
Guillaume Nault
58cff782cc flow_dissector: Parse multiple MPLS Label Stack Entries
The current MPLS dissector only parses the first MPLS Label Stack
Entry (second LSE can be parsed too, but only to set a key_id).

This patch adds the possibility to parse several LSEs by making
__skb_flow_dissect_mpls() return FLOW_DISSECT_RET_PROTO_AGAIN as long
as the Bottom Of Stack bit hasn't been seen, up to a maximum of
FLOW_DIS_MPLS_MAX entries.

FLOW_DIS_MPLS_MAX is arbitrarily set to 7. This should be enough for
many practical purposes, without wasting too much space.

To record the parsed values, flow_dissector_key_mpls is modified to
store an array of stack entries, instead of just the values of the
first one. A bit field, "used_lses", is also added to keep track of
the LSEs that have been set. The objective is to avoid defining a
new FLOW_DISSECTOR_KEY_MPLS_XX for each level of the MPLS stack.

TC flower is adapted for the new struct flow_dissector_key_mpls layout.
Matching on several MPLS Label Stack Entries will be added in the next
patch.

The NFP and MLX5 drivers are also adapted: nfp_flower_compile_mac() and
mlx5's parse_tunnel() now verify that the rule only uses the first LSE
and fail if it doesn't.

Finally, the behaviour of the FLOW_DISSECTOR_KEY_MPLS_ENTROPY key is
slightly modified. Instead of recording the first Entropy Label, it
now records the last one. This shouldn't have any consequences since
there doesn't seem to have any user of FLOW_DISSECTOR_KEY_MPLS_ENTROPY
in the tree. We'd probably better do a hash of all parsed MPLS labels
instead (excluding reserved labels) anyway. That'd give better entropy
and would probably also simplify the code. But that's not the purpose
of this patch, so I'm keeping that as a future possible improvement.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:22:58 -07:00
David S. Miller
fb8ddaa915 This cleanup patchset includes the following patches:
- Fix revert dynamic lockdep key changes for batman-adv,
    by Sven Eckelmann
 
  - use rcu_replace_pointer() where appropriate, by Antonio Quartulli
 
  - Revert "disable ethtool link speed detection when auto negotiation
    off", by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl7M5/oWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoRbjD/9+7kbUOYkMEz1x0izlXdXLSlMK
 EVU0iaJAecRxwN1+o7l1XstGIn5+Y77WsjaB9KKaOHOqWddZN8bQ1/b1lpp4s47n
 GR650eZcuPjXgSPVmiq86iztRNspOAokuCzFUkVh+AVeLMe/yonSxMpA8QFoYZMD
 KLxqxlG1xlPqvI+AULLaXSKWlt2tPNWBdpRwxL7TjupmblBMMwVxvi4G7J2Hxest
 PV//Sakm250kZnQXtoTT3vftDwBwoTFNOZGk9j/2gPrkSWx6l1CAOxp0KKP+sHkU
 ZlJtqabv0YCFPh17moOCB1Fy/RAvwAgwps1wb3+sK+YSsKg6R5VyAE9eajQx9+nL
 ypxLSFMs4JWSE+7qw9g6TnoQa1QsXIKmYNqyy5CwGfLg4F42rzA2obSYVA/I93ny
 KG00tqJM5FOaybXH59fKh5pti+PevWtPc1L5N0QCM6RX99dw79QSThx9wI7OeKfy
 xLTClx6y0Y2dZC80+zJUQgxDCFnA8+cvNznXNL/TYmf3kqrgUeC1G0prAdBuDfsd
 1AKwnVMVsVN4qUmxIrFFQHCtxu39GU2tU5RZy6WIAkpvN9zIYDvlaDOTnxZgfPpw
 59FmX/+0Gj9mh9FUWkJkyuZlkqFaztIkRVb10W1kXfALAIgp30pfSYGsDi0+0Y5J
 g4/cmMGRYNCN5RcAIg==
 =vDHL
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20200526' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - Fix revert dynamic lockdep key changes for batman-adv,
   by Sven Eckelmann

 - use rcu_replace_pointer() where appropriate, by Antonio Quartulli

 - Revert "disable ethtool link speed detection when auto negotiation
   off", by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:19:29 -07:00
Tuong Lien
0a3e060f34 tipc: add test for Nagle algorithm effectiveness
When streaming in Nagle mode, we try to bundle small messages from user
as many as possible if there is one outstanding buffer, i.e. not ACK-ed
by the receiving side, which helps boost up the overall throughput. So,
the algorithm's effectiveness really depends on when Nagle ACK comes or
what the specific network latency (RTT) is, compared to the user's
message sending rate.

In a bad case, the user's sending rate is low or the network latency is
small, there will not be many bundles, so making a Nagle ACK or waiting
for it is not meaningful.
For example: a user sends its messages every 100ms and the RTT is 50ms,
then for each messages, we require one Nagle ACK but then there is only
one user message sent without any bundles.

In a better case, even if we have a few bundles (e.g. the RTT = 300ms),
but now the user sends messages in medium size, then there will not be
any difference at all, that says 3 x 1000-byte data messages if bundled
will still result in 3 bundles with MTU = 1500.

When Nagle is ineffective, the delay in user message sending is clearly
wasted instead of sending directly.

Besides, adding Nagle ACKs will consume some processor load on both the
sending and receiving sides.

This commit adds a test on the effectiveness of the Nagle algorithm for
an individual connection in the network on which it actually runs.
Particularly, upon receipt of a Nagle ACK we will compare the number of
bundles in the backlog queue to the number of user messages which would
be sent directly without Nagle. If the ratio is good (e.g. >= 2), Nagle
mode will be kept for further message sending. Otherwise, we will leave
Nagle and put a 'penalty' on the connection, so it will have to spend
more 'one-way' messages before being able to re-enter Nagle.

In addition, the 'ack-required' bit is only set when really needed that
the number of Nagle ACKs will be reduced during Nagle mode.

Testing with benchmark showed that with the patch, there was not much
difference in throughput for small messages since the tool continuously
sends messages without a break, so Nagle would still take in effect.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:16:52 -07:00
Tuong Lien
03b6fefd9b tipc: add support for broadcast rcv stats dumping
This commit enables dumping the statistics of a broadcast-receiver link
like the traditional 'broadcast-link' one (which is for broadcast-
sender). The link dumping can be triggered via netlink (e.g. the
iproute2/tipc tool) by the link flag - 'TIPC_NLA_LINK_BROADCAST' as the
indicator.

The name of a broadcast-receiver link of a specific peer will be in the
format: 'broadcast-link:<peer-id>'.

For example:

Link <broadcast-link:1001002>
  Window:50 packets
  RX packets:7841 fragments:2408/440 bundles:0/0
  TX packets:0 fragments:0/0 bundles:0/0
  RX naks:0 defs:124 dups:0
  TX naks:21 acks:0 retrans:0
  Congestion link:0  Send queue max:0 avg:0

In addition, the broadcast-receiver link statistics can be reset in the
usual way via netlink by specifying that link name in command.

Note: the 'tipc_link_name_ext()' is removed because the link name can
now be retrieved simply via the 'l->name'.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:16:52 -07:00
Tuong Lien
a91d55d162 tipc: enable broadcast retrans via unicast
In some environment, broadcast traffic is suppressed at high rate (i.e.
a kind of bandwidth limit setting). When it is applied, TIPC broadcast
can still run successfully. However, when it comes to a high load, some
packets will be dropped first and TIPC tries to retransmit them but the
packet retransmission is intentionally broadcast too, so making things
worse and not helpful at all.

This commit enables the broadcast retransmission via unicast which only
retransmits packets to the specific peer that has really reported a gap
i.e. not broadcasting to all nodes in the cluster, so will prevent from
being suppressed, and also reduce some overheads on the other peers due
to duplicates, finally improve the overall TIPC broadcast performance.

Note: the functionality can be turned on/off via the sysctl file:

echo 1 > /proc/sys/net/tipc/bc_retruni
echo 0 > /proc/sys/net/tipc/bc_retruni

Default is '0', i.e. the broadcast retransmission still works as usual.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:16:52 -07:00
Tuong Lien
c6ed7a5cc2 tipc: add back link trace events
In the previous commit ("tipc: add Gap ACK blocks support for broadcast
link"), we have removed the following link trace events due to the code
changes:

- tipc_link_bc_ack
- tipc_link_retrans

This commit adds them back along with some minor changes to adapt to
the new code.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:16:52 -07:00
Tuong Lien
d7626b5acf tipc: introduce Gap ACK blocks for broadcast link
As achieved through commit 9195948fbf ("tipc: improve TIPC throughput
by Gap ACK blocks"), we apply the same mechanism for the broadcast link
as well. The 'Gap ACK blocks' data field in a 'PROTOCOL/STATE_MSG' will
consist of two parts built for both the broadcast and unicast types:

 31                       16 15                        0
+-------------+-------------+-------------+-------------+
|  bgack_cnt  |  ugack_cnt  |            len            |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > bc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -
|            gap            |            ack            |   |
+-------------+-------------+-------------+-------------+    > uc gacks
:                           :                           :   |
+-------------+-------------+-------------+-------------+  -

which is "automatically" backward-compatible.

We also increase the max number of Gap ACK blocks to 128, allowing upto
64 blocks per type (total buffer size = 516 bytes).

Besides, the 'tipc_link_advance_transmq()' function is refactored which
is applicable for both the unicast and broadcast cases now, so some old
functions can be removed and the code is optimized.

With the patch, TIPC broadcast is more robust regardless of packet loss
or disorder, latency, ... in the underlying network. Its performance is
boost up significantly.
For example, experiment with a 5% packet loss rate results:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    0m 42.46s
user    0m 1.16s
sys     0m 17.67s

Without the patch:

$ time tipc-pipe --mc --rdm --data_size 123 --data_num 1500000
real    8m 27.94s
user    0m 0.55s
sys     0m 2.38s

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:16:52 -07:00
Eric Dumazet
239174945d tcp: tcp_v4_err() icmp skb is named icmp_skb
I missed the fact that tcp_v4_err() differs from tcp_v6_err().

After commit 4d1a2d9ec1 ("Rename skb to icmp_skb in tcp_v4_err()")
the skb argument has been renamed to icmp_skb only in one function.

I will in a future patch reconciliate these functions to avoid
this kind of confusion.

Fixes: 45af29ca76 ("tcp: allow traceroute -Mtcp for unpriv users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:12:45 -07:00
Sven Eckelmann
9ad346c905 batman-adv: Revert "disable ethtool link speed detection when auto negotiation off"
The commit 8c46fcd783 ("batman-adv: disable ethtool link speed detection
when auto negotiation off") disabled the usage of ethtool's link_ksetting
when auto negotation was enabled due to invalid values when used with
tun/tap virtual net_devices. According to the patch, automatic measurements
should be used for these kind of interfaces.

But there are major flaws with this argumentation:

* automatic measurements are not implemented
* auto negotiation has nothing to do with the validity of the retrieved
  values

The first point has to be fixed by a longer patch series. The "validity"
part of the second point must be addressed in the same patch series by
dropping the usage of ethtool's link_ksetting (thus always doing automatic
measurements over ethernet).

Drop the patch again to have more default values for various net_device
types/configurations. The user can still overwrite them using the
batadv_hardif's BATADV_ATTR_THROUGHPUT_OVERRIDE.

Reported-by: Matthias Schiffer <mschiffer@universe-factory.net>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-05-26 09:23:33 +02:00
David S. Miller
963bdfc75c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Set VLAN tag in tcp reset/icmp unreachable packets to reject
   connections in the bridge family, from Michael Braun.

2) Incorrect subcounter flag update in ipset, from Phil Sutter.

3) Possible buffer overflow in the pptp conntrack helper, based
   on patch from Dan Carpenter.

4) Restore userspace conntrack helper hook logic that broke after
   hook consolidation rework.

5) Unbreak userspace conntrack helper registration via
   nfnetlink_cthelper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 18:28:42 -07:00
David S. Miller
1a6da4fcdd A few changes:
* fix a debugfs vs. wiphy rename crash
  * fix an invalid HE spec definition
  * fix a mesh timer crash
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl7L0skACgkQB8qZga/f
 l8RV0hAAnJaaF7hnBm3KuTgFWYdCUEc5IbaYZnD6TUM5xIX5IRrP4HtIrxL9K0sc
 h8AypNpvPU3OrDZOwoywMjD1LbgRo+91QK+3uo+ObUaGLpTytfeQVWu7x/yl7s+m
 SCbLBn9ahOipv3ODrR5JZ0PnNmbV7D9TSdtXHElQBpFd1KVJZMHeCvAzi6dZjT6H
 wVJ0JMHfQtl8prEBIqDFN8IbxYsMqYDBBoScqD0LOMg7TFGgSRlSLazE9pXPV7uT
 Q6wgSPmABa1C0lXI0TZcAT5Vkz5+9NpqC+lUtkBV4Eyrse5b8WNTeTWhvuybKMTf
 wdlDOoJg8CSWAbaMq5E8txzoimOZyqi9YKtWg2fbC4mtuM9Ur9JH+iO5oy9LoTkG
 DjR2dPEg3XQvczFJLlL/VmFp3c8amsEGd2DD00mIm9U1y9EDy/3GkoMQmDndBE3T
 /tvUDJkrH0pnntIIvn4kiDKMG47BV5Xm1wPfLKkwOY2K4z25Ze+D6ikzrFhenkqv
 s32J9D0m2jc4UygJcy+zdfqlNvckhrvrbhl+o0YaVHnqHpOYJpYkyq4nt+WdumBe
 fmEtUaPE4gC5PYiQPCz5Lnf4WtoC5fsj4jiRFBaJgotGEyZqBYN4zRhRXmIXOxRV
 /lJXHX3Uu9qY3RhSNWD/HDcz+p9D+tYTI+h4Sx9ffZiYKMolAPA=
 =+uUG
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few changes:
 * fix a debugfs vs. wiphy rename crash
 * fix an invalid HE spec definition
 * fix a mesh timer crash
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 18:18:26 -07:00
Horatiu Vultur
617504c67e bridge: mrp: Fix out-of-bounds read in br_mrp_parse
The issue was reported by syzbot. When the function br_mrp_parse was
called with a valid net_bridge_port, the net_bridge was an invalid
pointer. Therefore the check br->stp_enabled could pass/fail
depending where it was pointing in memory.
The fix consists of setting the net_bridge pointer if the port is a
valid pointer.

Reported-by: syzbot+9c6f0f1f8e32223df9a4@syzkaller.appspotmail.com
Fixes: 6536993371 ("bridge: mrp: Integrate MRP into the bridge")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 18:09:26 -07:00
Eric Dumazet
45af29ca76 tcp: allow traceroute -Mtcp for unpriv users
Unpriv users can use traceroute over plain UDP sockets, but not TCP ones.

$ traceroute -Mtcp 8.8.8.8
You do not have enough privileges to use this traceroute method.

$ traceroute -n -Mudp 8.8.8.8
traceroute to 8.8.8.8 (8.8.8.8), 30 hops max, 60 byte packets
 1  192.168.86.1  3.631 ms  3.512 ms  3.405 ms
 2  10.1.10.1  4.183 ms  4.125 ms  4.072 ms
 3  96.120.88.125  20.621 ms  19.462 ms  20.553 ms
 4  96.110.177.65  24.271 ms  25.351 ms  25.250 ms
 5  69.139.199.197  44.492 ms  43.075 ms  44.346 ms
 6  68.86.143.93  27.969 ms  25.184 ms  25.092 ms
 7  96.112.146.18  25.323 ms 96.112.146.22  25.583 ms 96.112.146.26  24.502 ms
 8  72.14.239.204  24.405 ms 74.125.37.224  16.326 ms  17.194 ms
 9  209.85.251.9  18.154 ms 209.85.247.55  14.449 ms 209.85.251.9  26.296 ms^C

We can easily support traceroute over TCP, by queueing an error message
into socket error queue.

Note that applications need to set IP_RECVERR/IPV6_RECVERR option to
enable this feature, and that the error message is only queued
while in SYN_SNT state.

socket(AF_INET6, SOCK_STREAM, IPPROTO_IP) = 3
setsockopt(3, SOL_IPV6, IPV6_RECVERR, [1], 4) = 0
setsockopt(3, SOL_SOCKET, SO_TIMESTAMP_OLD, [1], 4) = 0
setsockopt(3, SOL_IPV6, IPV6_UNICAST_HOPS, [5], 4) = 0
connect(3, {sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0}, 28) = -1 EHOSTUNREACH (No route to host)
recvmsg(3, {msg_name={sa_family=AF_INET6, sin6_port=htons(8787), sin6_flowinfo=htonl(0),
        inet_pton(AF_INET6, "2002:a05:6608:297::", &sin6_addr), sin6_scope_id=0},
        msg_namelen=1024->28, msg_iov=[{iov_base="`\r\337\320\0004\6\1&\7\370\260\200\231\16\27\0\0\0\0\0\0\0\0 \2\n\5f\10\2\227"..., iov_len=1024}],
        msg_iovlen=1, msg_control=[{cmsg_len=32, cmsg_level=SOL_SOCKET, cmsg_type=SO_TIMESTAMP_OLD, cmsg_data={tv_sec=1590340680, tv_usec=272424}},
                                   {cmsg_len=60, cmsg_level=SOL_IPV6, cmsg_type=IPV6_RECVERR}],
        msg_controllen=96, msg_flags=MSG_ERRQUEUE}, MSG_ERRQUEUE) = 144

Suggested-by: Maciej Żenczykowski <maze@google.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 17:54:06 -07:00
Dan Carpenter
6a1015b0b4 ipv4: potential underflow in compat_ip_setsockopt()
The value of "n" is capped at 0x1ffffff but it checked for negative
values.  I don't think this causes a problem but I'm not certain and
it's harmless to prevent it.

Fixes: 2e04172875 ("ipv4: do compat setsockopt for MCAST_MSFILTER directly")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 17:51:28 -07:00
Vinay Kumar Yadav
0cada33241 net/tls: fix race condition causing kernel panic
tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
// tls_sw_recvmsg()
	if (atomic_read(&ctx->decrypt_pending))
		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
	else
		reinit_completion(&ctx->async_wait.completion);

//tls_decrypt_done()
  	pending = atomic_dec_return(&ctx->decrypt_pending);

  	if (!pending && READ_ONCE(ctx->async_notify))
  		complete(&ctx->async_wait.completion);

Consider the scenario tls_decrypt_done() is about to run complete()

	if (!pending && READ_ONCE(ctx->async_notify))

and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
then tls_decrypt_done() runs complete(). This sequence of execution
results in wrong completion. Consequently, for next decrypt request,
it will not wait for completion, eventually on connection close, crypto
resources freed, there is no way to handle pending decrypt response.

This race condition can be avoided by having atomic_read() mutually
exclusive with atomic_dec_return(),complete().Intoduced spin lock to
ensure the mutual exclution.

Addressed similar problem in tx direction.

v1->v2:
- More readable commit message.
- Corrected the lock to fix new race scenario.
- Removed barrier which is not needed now.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 17:41:40 -07:00
Björn Töpel
b16a87d0ae xsk: Add overflow check for u64 division, stored into u32
The npgs member of struct xdp_umem is an u32 entity, and stores the
number of pages the UMEM consumes. The calculation of npgs

  npgs = size / PAGE_SIZE

can overflow.

To avoid overflow scenarios, the division is now first stored in a
u64, and the result is verified to fit into 32b.

An alternative would be storing the npgs as a u64, however, this
wastes memory and is an unrealisticly large packet area.

Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Reported-by: "Minh Bùi Quang" <minhquangbui99@gmail.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/CACtPs=GGvV-_Yj6rbpzTVnopgi5nhMoCcTkSkYrJHGQHJWFZMQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20200525080400.13195-1-bjorn.topel@gmail.com
2020-05-26 00:06:00 +02:00
Pablo Neira Ayuso
703acd70f2 netfilter: nfnetlink_cthelper: unbreak userspace helper support
Restore helper data size initialization and fix memcopy of the helper
data size.

Fixes: 157ffffeb5 ("netfilter: nfnetlink_cthelper: reject too large userspace allocation requests")
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25 20:39:14 +02:00
Pablo Neira Ayuso
ee04805ff5 netfilter: conntrack: make conntrack userspace helpers work again
Florian Westphal says:

"Problem is that after the helper hook was merged back into the confirm
one, the queueing itself occurs from the confirm hook, i.e. we queue
from the last netfilter callback in the hook-list.

Therefore, on return, the packet bypasses the confirm action and the
connection is never committed to the main conntrack table.

To fix this there are several ways:
1. revert the 'Fixes' commit and have a extra helper hook again.
   Works, but has the drawback of adding another indirect call for
   everyone.

2. Special case this: split the hooks only when userspace helper
   gets added, so queueing occurs at a lower priority again,
   and normal enqueue reinject would eventually call the last hook.

3. Extend the existing nf_queue ct update hook to allow a forced
   confirmation (plus run the seqadj code).

This goes for 3)."

Fixes: 827318feb6 ("netfilter: conntrack: remove helper hook again")
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25 20:39:14 +02:00
Pablo Neira Ayuso
4c559f15ef netfilter: nf_conntrack_pptp: prevent buffer overflows in debug code
Dan Carpenter says: "Smatch complains that the value for "cmd" comes
from the network and can't be trusted."

Add pptp_msg_name() helper function that checks for the array boundary.

Fixes: f09943fefe ("[NETFILTER]: nf_conntrack/nf_nat: add PPTP helper port")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25 20:39:14 +02:00
Phil Sutter
a164b95ad6 netfilter: ipset: Fix subcounter update skip
If IPSET_FLAG_SKIP_SUBCOUNTER_UPDATE is set, user requested to not
update counters in sub sets. Therefore IPSET_FLAG_SKIP_COUNTER_UPDATE
must be set, not unset.

Fixes: 6e01781d1c ("netfilter: ipset: set match: add support to match the counters")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25 20:39:14 +02:00
Michael Braun
e9c284ec4b netfilter: nft_reject_bridge: enable reject with bridge vlan
Currently, using the bridge reject target with tagged packets
results in untagged packets being sent back.

Fix this by mirroring the vlan id as well.

Fixes: 85f5b3086a ("netfilter: bridge: add reject support")
Signed-off-by: Michael Braun <michael-dev@fami-braun.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-25 20:39:05 +02:00
Valdis Kl ē tnieks
9f64fbdb77 bpfilter: document build requirements for bpfilter_umh
It's not intuitively obvious that bpfilter_umh is a statically linked binary.
Mention the toolchain requirement in the Kconfig help, so people
have an easier time figuring out what's needed.

Signed-off-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-05-26 00:03:16 +09:00
Xin Long
ed17b8d377 xfrm: fix a warning in xfrm_policy_insert_list
This waring can be triggered simply by:

  # ip xfrm policy update src 192.168.1.1/24 dst 192.168.1.2/24 dir in \
    priority 1 mark 0 mask 0x10  #[1]
  # ip xfrm policy update src 192.168.1.1/24 dst 192.168.1.2/24 dir in \
    priority 2 mark 0 mask 0x1   #[2]
  # ip xfrm policy update src 192.168.1.1/24 dst 192.168.1.2/24 dir in \
    priority 2 mark 0 mask 0x10  #[3]

Then dmesg shows:

  [ ] WARNING: CPU: 1 PID: 7265 at net/xfrm/xfrm_policy.c:1548
  [ ] RIP: 0010:xfrm_policy_insert_list+0x2f2/0x1030
  [ ] Call Trace:
  [ ]  xfrm_policy_inexact_insert+0x85/0xe50
  [ ]  xfrm_policy_insert+0x4ba/0x680
  [ ]  xfrm_add_policy+0x246/0x4d0
  [ ]  xfrm_user_rcv_msg+0x331/0x5c0
  [ ]  netlink_rcv_skb+0x121/0x350
  [ ]  xfrm_netlink_rcv+0x66/0x80
  [ ]  netlink_unicast+0x439/0x630
  [ ]  netlink_sendmsg+0x714/0xbf0
  [ ]  sock_sendmsg+0xe2/0x110

The issue was introduced by Commit 7cb8a93968 ("xfrm: Allow inserting
policies with matching mark and different priorities"). After that, the
policies [1] and [2] would be able to be added with different priorities.

However, policy [3] will actually match both [1] and [2]. Policy [1]
was matched due to the 1st 'return true' in xfrm_policy_mark_match(),
and policy [2] was matched due to the 2nd 'return true' in there. It
caused WARN_ON() in xfrm_policy_insert_list().

This patch is to fix it by only (the same value and priority) as the
same policy in xfrm_policy_mark_match().

Thanks to Yuehaibing, we could make this fix better.

v1->v2:
  - check policy->mark.v == pol->mark.v only without mask.

Fixes: 7cb8a93968 ("xfrm: Allow inserting policies with matching mark and different priorities")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-25 16:04:12 +02:00
Johannes Berg
0bbab5f030 cfg80211: fix debugfs rename crash
Removing the "if (IS_ERR(dir)) dir = NULL;" check only works
if we adjust the remaining code to not rely on it being NULL.
Check IS_ERR_OR_NULL() before attempting to dereference it.

I'm not actually entirely sure this fixes the syzbot crash as
the kernel config indicates that they do have DEBUG_FS in the
kernel, but this is what I found when looking there.

Cc: stable@vger.kernel.org
Fixes: d82574a8e5 ("cfg80211: no need to check return value of debugfs_create functions")
Reported-by: syzbot+fd5332e429401bf42d18@syzkaller.appspotmail.com
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20200525113816.fc4da3ec3d4b.Ica63a110679819eaa9fb3bc1b7437d96b1fd187d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-25 13:12:32 +02:00
Linus Lüssing
e2d4a80f93 mac80211: mesh: fix discovery timer re-arming issue / crash
On a non-forwarding 802.11s link between two fairly busy
neighboring nodes (iperf with -P 16 at ~850MBit/s TCP;
1733.3 MBit/s VHT-MCS 9 80MHz short GI VHT-NSS 4), so with
frequent PREQ retries, usually after around 30-40 seconds the
following crash would occur:

[ 1110.822428] Unable to handle kernel read from unreadable memory at virtual address 00000000
[ 1110.830786] Mem abort info:
[ 1110.833573]   Exception class = IABT (current EL), IL = 32 bits
[ 1110.839494]   SET = 0, FnV = 0
[ 1110.842546]   EA = 0, S1PTW = 0
[ 1110.845678] user pgtable: 4k pages, 48-bit VAs, pgd = ffff800076386000
[ 1110.852204] [0000000000000000] *pgd=00000000f6322003, *pud=00000000f62de003, *pmd=0000000000000000
[ 1110.861167] Internal error: Oops: 86000004 [#1] PREEMPT SMP
[ 1110.866730] Modules linked in: pppoe ppp_async batman_adv ath10k_pci ath10k_core ath pppox ppp_generic nf_conntrack_ipv6 mac80211 iptable_nat ipt_REJECT ipt_MASQUERADE cfg80211 xt_time xt_tcpudp xt_state xt_nat xt_multiport xt_mark xt_mac xt_limit xt_conntrack xt_comment xt_TCPMSS xt_REDIRECT xt_LOG xt_FLOWOFFLOAD slhc nf_reject_ipv4 nf_nat_redirect nf_nat_masquerade_ipv4 nf_conntrack_ipv4 nf_nat_ipv4 nf_nat nf_log_ipv4 nf_flow_table_hw nf_flow_table nf_defrag_ipv6 nf_defrag_ipv4 nf_conntrack_rtcache nf_conntrack iptable_mangle iptable_filter ip_tables crc_ccitt compat nf_log_ipv6 nf_log_common ip6table_mangle ip6table_filter ip6_tables ip6t_REJECT x_tables nf_reject_ipv6 usb_storage xhci_plat_hcd xhci_pci xhci_hcd dwc3 usbcore usb_common
[ 1110.932190] Process swapper/3 (pid: 0, stack limit = 0xffff0000090c8000)
[ 1110.938884] CPU: 3 PID: 0 Comm: swapper/3 Not tainted 4.14.162 #0
[ 1110.944965] Hardware name: LS1043A RGW Board (DT)
[ 1110.949658] task: ffff8000787a81c0 task.stack: ffff0000090c8000
[ 1110.955568] PC is at 0x0
[ 1110.958097] LR is at call_timer_fn.isra.27+0x24/0x78
[ 1110.963055] pc : [<0000000000000000>] lr : [<ffff0000080ff29c>] pstate: 00400145
[ 1110.970440] sp : ffff00000801be10
[ 1110.973744] x29: ffff00000801be10 x28: ffff000008bf7018
[ 1110.979047] x27: ffff000008bf87c8 x26: ffff000008c160c0
[ 1110.984352] x25: 0000000000000000 x24: 0000000000000000
[ 1110.989657] x23: dead000000000200 x22: 0000000000000000
[ 1110.994959] x21: 0000000000000000 x20: 0000000000000101
[ 1111.000262] x19: ffff8000787a81c0 x18: 0000000000000000
[ 1111.005565] x17: ffff0000089167b0 x16: 0000000000000058
[ 1111.010868] x15: ffff0000089167b0 x14: 0000000000000000
[ 1111.016172] x13: ffff000008916788 x12: 0000000000000040
[ 1111.021475] x11: ffff80007fda9af0 x10: 0000000000000001
[ 1111.026777] x9 : ffff00000801bea0 x8 : 0000000000000004
[ 1111.032080] x7 : 0000000000000000 x6 : ffff80007fda9aa8
[ 1111.037383] x5 : ffff00000801bea0 x4 : 0000000000000010
[ 1111.042685] x3 : ffff00000801be98 x2 : 0000000000000614
[ 1111.047988] x1 : 0000000000000000 x0 : 0000000000000000
[ 1111.053290] Call trace:
[ 1111.055728] Exception stack(0xffff00000801bcd0 to 0xffff00000801be10)
[ 1111.062158] bcc0:                                   0000000000000000 0000000000000000
[ 1111.069978] bce0: 0000000000000614 ffff00000801be98 0000000000000010 ffff00000801bea0
[ 1111.077798] bd00: ffff80007fda9aa8 0000000000000000 0000000000000004 ffff00000801bea0
[ 1111.085618] bd20: 0000000000000001 ffff80007fda9af0 0000000000000040 ffff000008916788
[ 1111.093437] bd40: 0000000000000000 ffff0000089167b0 0000000000000058 ffff0000089167b0
[ 1111.101256] bd60: 0000000000000000 ffff8000787a81c0 0000000000000101 0000000000000000
[ 1111.109075] bd80: 0000000000000000 dead000000000200 0000000000000000 0000000000000000
[ 1111.116895] bda0: ffff000008c160c0 ffff000008bf87c8 ffff000008bf7018 ffff00000801be10
[ 1111.124715] bdc0: ffff0000080ff29c ffff00000801be10 0000000000000000 0000000000400145
[ 1111.132534] bde0: ffff8000787a81c0 ffff00000801bde8 0000ffffffffffff 000001029eb19be8
[ 1111.140353] be00: ffff00000801be10 0000000000000000
[ 1111.145220] [<          (null)>]           (null)
[ 1111.149917] [<ffff0000080ff77c>] run_timer_softirq+0x184/0x398
[ 1111.155741] [<ffff000008081938>] __do_softirq+0x100/0x1fc
[ 1111.161130] [<ffff0000080a2e28>] irq_exit+0x80/0xd8
[ 1111.166002] [<ffff0000080ea708>] __handle_domain_irq+0x88/0xb0
[ 1111.171825] [<ffff000008081678>] gic_handle_irq+0x68/0xb0
[ 1111.177213] Exception stack(0xffff0000090cbe30 to 0xffff0000090cbf70)
[ 1111.183642] be20:                                   0000000000000020 0000000000000000
[ 1111.191461] be40: 0000000000000001 0000000000000000 00008000771af000 0000000000000000
[ 1111.199281] be60: ffff000008c95180 0000000000000000 ffff000008c19360 ffff0000090cbef0
[ 1111.207101] be80: 0000000000000810 0000000000000400 0000000000000098 ffff000000000000
[ 1111.214920] bea0: 0000000000000001 ffff0000089167b0 0000000000000000 ffff0000089167b0
[ 1111.222740] bec0: 0000000000000000 ffff000008c198e8 ffff000008bf7018 ffff000008c19000
[ 1111.230559] bee0: 0000000000000000 0000000000000000 ffff8000787a81c0 ffff000008018000
[ 1111.238380] bf00: ffff00000801c000 ffff00000913ba34 ffff8000787a81c0 ffff0000090cbf70
[ 1111.246199] bf20: ffff0000080857cc ffff0000090cbf70 ffff0000080857d0 0000000000400145
[ 1111.254020] bf40: ffff000008018000 ffff00000801c000 ffffffffffffffff ffff0000080fa574
[ 1111.261838] bf60: ffff0000090cbf70 ffff0000080857d0
[ 1111.266706] [<ffff0000080832e8>] el1_irq+0xe8/0x18c
[ 1111.271576] [<ffff0000080857d0>] arch_cpu_idle+0x10/0x18
[ 1111.276880] [<ffff0000080d7de4>] do_idle+0xec/0x1b8
[ 1111.281748] [<ffff0000080d8020>] cpu_startup_entry+0x20/0x28
[ 1111.287399] [<ffff00000808f81c>] secondary_start_kernel+0x104/0x110
[ 1111.293662] Code: bad PC value
[ 1111.296710] ---[ end trace 555b6ca4363c3edd ]---
[ 1111.301318] Kernel panic - not syncing: Fatal exception in interrupt
[ 1111.307661] SMP: stopping secondary CPUs
[ 1111.311574] Kernel Offset: disabled
[ 1111.315053] CPU features: 0x0002000
[ 1111.318530] Memory Limit: none
[ 1111.321575] Rebooting in 3 seconds..

With some added debug output / delays we were able to push the crash from
the timer callback runner into the callback function and by that shedding
some light on which object holding the timer gets corrupted:

[  401.720899] Unable to handle kernel read from unreadable memory at virtual address 00000868
[...]
[  402.335836] [<ffff0000088fafa4>] _raw_spin_lock_bh+0x14/0x48
[  402.341548] [<ffff000000dbe684>] mesh_path_timer+0x10c/0x248 [mac80211]
[  402.348154] [<ffff0000080ff29c>] call_timer_fn.isra.27+0x24/0x78
[  402.354150] [<ffff0000080ff77c>] run_timer_softirq+0x184/0x398
[  402.359974] [<ffff000008081938>] __do_softirq+0x100/0x1fc
[  402.365362] [<ffff0000080a2e28>] irq_exit+0x80/0xd8
[  402.370231] [<ffff0000080ea708>] __handle_domain_irq+0x88/0xb0
[  402.376053] [<ffff000008081678>] gic_handle_irq+0x68/0xb0

The issue happens due to the following sequence of events:

1) mesh_path_start_discovery():
-> spin_unlock_bh(&mpath->state_lock) before mesh_path_sel_frame_tx()

2) mesh_path_free_rcu()
-> del_timer_sync(&mpath->timer)
   [...]
-> kfree_rcu(mpath)

3) mesh_path_start_discovery():
-> mod_timer(&mpath->timer, ...)
   [...]
-> rcu_read_unlock()

4) mesh_path_free_rcu()'s kfree_rcu():
-> kfree(mpath)

5) mesh_path_timer() starts after timeout, using freed mpath object

So a use-after-free issue due to a timer re-arming bug caused by an
early spin-unlocking.

This patch fixes this issue by re-checking if mpath is about to be
free'd and if so bails out of re-arming the timer.

Cc: stable@vger.kernel.org
Fixes: 050ac52cbe ("mac80211: code for on-demand Hybrid Wireless Mesh Protocol")
Cc: Simon Wunderlich <sw@simonwunderlich.de>
Signed-off-by: Linus Lüssing <ll@simonwunderlich.de>
Link: https://lore.kernel.org/r/20200522170413.14973-1-linus.luessing@c0d3.blue
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-25 10:31:16 +02:00
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
Linus Torvalds
caffb99b69 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix RCU warnings in ipv6 multicast router code, from Madhuparna
    Bhowmik.

 2) Nexthop attributes aren't being checked properly because of
    mis-initialized iterator, from David Ahern.

 3) Revert iop_idents_reserve() change as it caused performance
    regressions and was just working around what is really a UBSAN bug
    in the compiler. From Yuqi Jin.

 4) Read MAC address properly from ROM in bmac driver (double iteration
    proceeds past end of address array), from Jeremy Kerr.

 5) Add Microsoft Surface device IDs to r8152, from Marc Payne.

 6) Prevent reference to freed SKB in __netif_receive_skb_core(), from
    Boris Sukholitko.

 7) Fix ACK discard behavior in rxrpc, from David Howells.

 8) Preserve flow hash across packet scrubbing in wireguard, from Jason
    A. Donenfeld.

 9) Cap option length properly for SO_BINDTODEVICE in AX25, from Eric
    Dumazet.

10) Fix encryption error checking in kTLS code, from Vadim Fedorenko.

11) Missing BPF prog ref release in flow dissector, from Jakub Sitnicki.

12) dst_cache must be used with BH disabled in tipc, from Eric Dumazet.

13) Fix use after free in mlxsw driver, from Jiri Pirko.

14) Order kTLS key destruction properly in mlx5 driver, from Tariq
    Toukan.

15) Check devm_platform_ioremap_resource() return value properly in
    several drivers, from Tiezhu Yang.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (71 commits)
  net: smsc911x: Fix runtime PM imbalance on error
  net/mlx4_core: fix a memory leak bug.
  net: ethernet: ti: cpsw: fix ASSERT_RTNL() warning during suspend
  net: phy: mscc: fix initialization of the MACsec protocol mode
  net: stmmac: don't attach interface until resume finishes
  net: Fix return value about devm_platform_ioremap_resource()
  net/mlx5: Fix error flow in case of function_setup failure
  net/mlx5e: CT: Correctly get flow rule
  net/mlx5e: Update netdev txq on completions during closure
  net/mlx5: Annotate mutex destroy for root ns
  net/mlx5: Don't maintain a case of del_sw_func being null
  net/mlx5: Fix cleaning unmanaged flow tables
  net/mlx5: Fix memory leak in mlx5_events_init
  net/mlx5e: Fix inner tirs handling
  net/mlx5e: kTLS, Destroy key object after destroying the TIS
  net/mlx5e: Fix allowed tc redirect merged eswitch offload cases
  net/mlx5: Avoid processing commands before cmdif is ready
  net/mlx5: Fix a race when moving command interface to events mode
  net/mlx5: Add command entry handling completion
  rxrpc: Fix a memory leak in rxkad_verify_response()
  ...
2020-05-23 17:16:18 -07:00
Heiner Kallweit
316107119f ethtool: propagate get_coalesce return value
get_coalesce returns 0 or ERRNO, but the return value isn't checked.
The returned coalesce data may be invalid if an ERRNO is set,
therefore better check and propagate the return value.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23 16:57:00 -07:00
Bartosz Golaszewski
cd16627fc0 net: devres: provide devm_register_netdev()
Provide devm_register_netdev() - a device resource managed variant
of register_netdev(). This new helper will only work for net_device
structs that are also already managed by devres.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23 16:56:17 -07:00
Bartosz Golaszewski
f75063abc3 net: devres: define a separate devres structure for devm_alloc_etherdev()
Not using a proxy structure to store struct net_device doesn't save
anything in terms of compiled code size or memory usage but significantly
decreases the readability of the code with all the pointer casting.

Define struct net_device_devres and use it in devm_alloc_etherdev_mqs().

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23 16:56:17 -07:00
Bartosz Golaszewski
cb8a14b205 net: move devres helpers into a separate source file
There's currently only a single devres helper in net/ - devm variant
of alloc_etherdev. Let's move it to net/devres.c with the intention of
assing a second one: devm_register_netdev(). This new routine will need
to know the address of the release function of devm_alloc_etherdev() so
that it can verify (using devres_find()) that the struct net_device
that's being passed to it is also resource managed.

Signed-off-by: Bartosz Golaszewski <bgolaszewski@baylibre.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23 16:56:17 -07:00
Randy Dunlap
07a7f30819 net: psample: fix build error when CONFIG_INET is not enabled
Fix psample build error when CONFIG_INET is not set/enabled by
bracketing the tunnel code in #ifdef CONFIG_NET / #endif.

../net/psample/psample.c: In function ‘__psample_ip_tun_to_nlattr’:
../net/psample/psample.c:216:25: error: implicit declaration of function ‘ip_tunnel_info_opts’; did you mean ‘ip_tunnel_info_opts_set’? [-Werror=implicit-function-declaration]

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Yotam Gigi <yotam.gi@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23 16:36:05 -07:00
David S. Miller
a152b85984 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-05-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 50 non-merge commits during the last 8 day(s) which contain
a total of 109 files changed, 2776 insertions(+), 2887 deletions(-).

The main changes are:

1) Add a new AF_XDP buffer allocation API to the core in order to help
   lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe
   as well as mlx5 have been moved over to the new API and also gained a
   small improvement in performance, from Björn Töpel and Magnus Karlsson.

2) Add getpeername()/getsockname() attach types for BPF sock_addr programs
   in order to allow for e.g. reverse translation of load-balancer backend
   to service address/port tuple from a connected peer, from Daniel Borkmann.

3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers
   being non-NULL, e.g. if after an initial test another non-NULL test on
   that pointer follows in a given path, then it can be pruned right away,
   from John Fastabend.

4) Larger rework of BPF sockmap selftests to make output easier to understand
   and to reduce overall runtime as well as adding new BPF kTLS selftests
   that run in combination with sockmap, also from John Fastabend.

5) Batch of misc updates to BPF selftests including fixing up test_align
   to match verifier output again and moving it under test_progs, allowing
   bpf_iter selftest to compile on machines with older vmlinux.h, and
   updating config options for lirc and v6 segment routing helpers, from
   Stanislav Fomichev, Andrii Nakryiko and Alan Maguire.

6) Conversion of BPF tracing samples outdated internal BPF loader to use
   libbpf API instead, from Daniel T. Lee.

7) Follow-up to BPF kernel test infrastructure in order to fix a flake in
   the XDP selftests, from Jesper Dangaard Brouer.

8) Minor improvements to libbpf's internal hashmap implementation, from
   Ian Rogers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 18:30:34 -07:00
Qiushi Wu
f45d01f4f3 rxrpc: Fix a memory leak in rxkad_verify_response()
A ticket was not released after a call of the function
"rxkad_decrypt_ticket" failed. Thus replace the jump target
"temporary_error_free_resp" by "temporary_error_free_ticket".

Fixes: 8c2f826dc3 ("rxrpc: Don't put crypto buffers on the stack")
Signed-off-by: Qiushi Wu <wu000273@umn.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Markus Elfring <Markus.Elfring@web.de>
2020-05-23 00:35:46 +01:00
Horatiu Vultur
4fb13499d3 bridge: mrp: Restore port state when deleting MRP instance
When a MRP instance is deleted, then restore the port according to the
bridge state. If the bridge is up then the ports will be in forwarding
state otherwise will be in disabled state.

Fixes: 9a9f26e8f7 ("bridge: mrp: Connect MRP API with the switchdev API")
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 16:17:15 -07:00
Horatiu Vultur
7aa38018be bridge: mrp: Add br_mrp_unique_ifindex function
It is not allow to have the same net bridge port part of multiple MRP
rings. Therefore add a check if the port is used already in a different
MRP. In that case return failure.

Fixes: 9a9f26e8f7 ("bridge: mrp: Connect MRP API with the switchdev API")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 16:17:15 -07:00
David S. Miller
4629ed2e48 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7FQH4ACgkQ+7dXa6fL
 C2u05BAAlilXCYGZU23/yQmMxxkmcG+jW+oDV9ySNDl6iJT+OtKxHEGReDD4D2f0
 rhaBivgGcOnZy89AGjNrSVROGlwSBXl7ArJcjsfsx8AuNzxHUHQKWlW/k8n87qEt
 NCTze7f65IT6NowgYAFgJn5kIpY/9iKuNiCf6NGL3Z35wqxPvwNs6AQSGM495uvB
 el/ddkr8QzzjI9Ejsgzj94x4DAOjk4T4WzfWMAgyr1OEqz6vKNKkCwSKPySOsQAK
 72JRaGhWA9rfAOkA7nAZpnjHdfFYnkFBOVQzmswOJYRYe3D/QY5D9PUlGIQ5OSjL
 yV5YOi/+AUrSif79NfEYXga0r/NFJMFqBg2zo/eiSrhfZZFZMDcagnGhzpGjbYF1
 IaeIu4q/MQOQybi8m1GJhvFfPOhdKRn731jlsUvEoxK0TonSu/u64eus+qelQxOd
 uiIcu/kLxfPZSznUd8cXZ+Pffce0uBIRWq0nRQZ703TyHY+/gYo7ZGHr/FZNKaK4
 lRNP4Nu3goLQCI40R7y7USnpX+kWfd4mYC9zl+VBSXG1JymYbOezXYrNATBCqpo6
 9VoYtqDdo8ESksFUBqM7fGRDZ20nah6KdRGmnrPU+rpODHZEZmNN7D/rayU31wua
 VIbVw1WluvSVnQ8+b1BwJwvhJQ4CazyGcnbxDqx1zd07EjUiiJI=
 =jJvy
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200520' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fix retransmission timeout and ACK discard

Here are a couple of fixes and an extra tracepoint for AF_RXRPC:

 (1) Calculate the RTO pretty much as TCP does, rather than making
     something up, including an initial 4s timeout (which causes return
     probes from the fileserver to fail if a packet goes missing), and add
     backoff.

 (2) Fix the discarding of out-of-order received ACKs.  We mustn't let the
     hard-ACK point regress, nor do we want to do unnecessary
     retransmission because the soft-ACK list regresses.  This is not
     trivial, however, due to some loose wording in various old protocol
     specs, the ACK field that should be used for this sometimes has the
     wrong information in it.

 (3) Add a tracepoint to log a discarded ACK.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:53:30 -07:00
Edward Cree
060b6381ef net: flow_offload: simplify hw stats check handling
Make FLOW_ACTION_HW_STATS_DONT_CARE be all bits, rather than none, so that
 drivers and __flow_action_hw_stats_check can use simple bitwise checks.

Pre-fill all actions with DONT_CARE in flow_rule_alloc(), rather than
 relying on implicit semantics of zero from kzalloc, so that callers which
 don't configure action stats themselves (i.e. netfilter) get the correct
 behaviour by default.

Only the kernel's internal API semantics change; the TC uAPI is unaffected.

v4: move DONT_CARE setting to flow_rule_alloc() for robustness and simplicity.

v3: set DONT_CARE in nft and ct offload.

v2: rebased on net-next, removed RFC tags.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:52:08 -07:00
Vadim Fedorenko
1515aa70c0 mpls: Add support for IPv6 tunnels
Add support for IPv6 tunnel devices in AF_MPLS.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:49:31 -07:00
Vadim Fedorenko
f200e98d97 ip6_tunnel: add generic MPLS receive support
Add support for MPLS in receive side.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:49:31 -07:00
Vadim Fedorenko
f234efac2c tunnel6: support for IPPROTO_MPLS
This patch is just preparation for MPLS support in ip6_tunnel

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:49:30 -07:00
Vadim Fedorenko
6c11fbf97e ip6_tunnel: add MPLS transmit support
Add ETH_P_MPLS_UC as supported protocol.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:49:30 -07:00
Vadim Fedorenko
e7bb18e6c8 ip6_tunnel: simplify transmit path
Merge ip{4,6}ip6_tnl_xmit functions into one universal
ipxip6_tnl_xmit in preparation for adding MPLS support.

Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:49:30 -07:00
Jere Leppänen
d3e8e4c118 sctp: Start shutdown on association restart if in SHUTDOWN-SENT state and socket is closed
Commit bdf6fa52f0 ("sctp: handle association restarts when the
socket is closed.") starts shutdown when an association is restarted,
if in SHUTDOWN-PENDING state and the socket is closed. However, the
rationale stated in that commit applies also when in SHUTDOWN-SENT
state - we don't want to move an association to ESTABLISHED state when
the socket has been closed, because that results in an association
that is unreachable from user space.

The problem scenario:

1.  Client crashes and/or restarts.

2.  Server (using one-to-one socket) calls close(). SHUTDOWN is lost.

3.  Client reconnects using the same addresses and ports.

4.  Server's association is restarted. The association and the socket
    move to ESTABLISHED state, even though the server process has
    closed its descriptor.

Also, after step 4 when the server process exits, some resources are
leaked in an attempt to release the underlying inet sock structure in
ESTABLISHED state:

    IPv4: Attempt to release TCP socket in state 1 00000000377288c7

Fix by acting the same way as in SHUTDOWN-PENDING state. That is, if
an association is restarted in SHUTDOWN-SENT state and the socket is
closed, then start shutdown and don't move the association or the
socket to ESTABLISHED state.

Fixes: bdf6fa52f0 ("sctp: handle association restarts when the socket is closed.")
Signed-off-by: Jere Leppänen <jere.leppanen@nokia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:47:59 -07:00
Eric Dumazet
1378817486 tipc: block BH before using dst_cache
dst_cache_get() documents it must be used with BH disabled.

sysbot reported :

BUG: using smp_processor_id() in preemptible [00000000] code: /21697
caller is dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68
CPU: 0 PID: 21697 Comm:  Not tainted 5.7.0-rc6-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 check_preemption_disabled lib/smp_processor_id.c:47 [inline]
 debug_smp_processor_id.cold+0x88/0x9b lib/smp_processor_id.c:57
 dst_cache_get+0x3a/0xb0 net/core/dst_cache.c:68
 tipc_udp_xmit.isra.0+0xb9/0xad0 net/tipc/udp_media.c:164
 tipc_udp_send_msg+0x3e6/0x490 net/tipc/udp_media.c:244
 tipc_bearer_xmit_skb+0x1de/0x3f0 net/tipc/bearer.c:526
 tipc_enable_bearer+0xb2f/0xd60 net/tipc/bearer.c:331
 __tipc_nl_bearer_enable+0x2bf/0x390 net/tipc/bearer.c:995
 tipc_nl_bearer_enable+0x1e/0x30 net/tipc/bearer.c:1003
 genl_family_rcv_msg_doit net/netlink/genetlink.c:673 [inline]
 genl_family_rcv_msg net/netlink/genetlink.c:718 [inline]
 genl_rcv_msg+0x627/0xdf0 net/netlink/genetlink.c:735
 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
 genl_rcv+0x24/0x40 net/netlink/genetlink.c:746
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362
 ___sys_sendmsg+0x100/0x170 net/socket.c:2416
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45ca29

Fixes: e9c1a79321 ("tipc: add dst_cache support for udp media")
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:39:00 -07:00
David S. Miller
d3b968bc2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-05-22

The following pull-request contains BPF updates for your *net* tree.

We've added 3 non-merge commits during the last 3 day(s) which contain
a total of 5 files changed, 69 insertions(+), 11 deletions(-).

The main changes are:

1) Fix to reject mmap()'ing read-only array maps as writable since BPF verifier
   relies on such map content to be frozen, from Andrii Nakryiko.

2) Fix breaking audit from secid_to_secctx() LSM hook by avoiding to use
   call_int_hook() since this hook is not stackable, from KP Singh.

3) Fix BPF flow dissector program ref leak on netns cleanup, from Jakub Sitnicki.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:35:35 -07:00
Todd Malsbary
bd6972226f mptcp: use untruncated hash in ADD_ADDR HMAC
There is some ambiguity in the RFC as to whether the ADD_ADDR HMAC is
the rightmost 64 bits of the entire hash or of the leftmost 160 bits
of the hash.  The intention, as clarified with the author of the RFC,
is the entire hash.

This change returns the entire hash from
mptcp_crypto_hmac_sha (instead of only the first 160 bits), and moves
any truncation/selection operation on the hash to the caller.

Fixes: 12555a2d97 ("mptcp: use rightmost 64 bits in ADD_ADDR HMAC")
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Todd Malsbary <todd.malsbary@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:21:24 -07:00
Roopa Prabhu
8590ceedb7 nexthop: add support for notifiers
This patch adds nexthop add/del notifiers. To be used by
vxlan driver in a later patch. Could possibly be used by
switchdev drivers in the future.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:00:38 -07:00
Roopa Prabhu
1274e1cc42 vxlan: ecmp support for mac fdb entries
Todays vxlan mac fdb entries can point to multiple remote
ips (rdsts) with the sole purpose of replicating
broadcast-multicast and unknown unicast packets to those remote ips.

E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be
load balanced to remote switches (vteps) belonging to the
same multi-homed ethernet segment (E-VPN multihoming is analogous
to multi-homed LAG implementations, but with the inter-switch
peerlink replaced with a vxlan tunnel). In other words it needs
support for mac ecmp. Furthermore, for faster convergence, E-VPN
multihoming needs the ability to update fdb ecmp nexthops independent
of the fdb entries.

New route nexthop API is perfect for this usecase.
This patch extends the vxlan fdb code to take a nexthop id
pointing to an ecmp nexthop group.

Changes include:
- New NDA_NH_ID attribute for fdbs
- Use the newly added fdb nexthop groups
- makes vxlan rdsts and nexthop handling code mutually
  exclusive
- since this is a new use-case and the requirement is for ecmp
nexthop groups, the fdb add and update path checks that the
nexthop is really an ecmp nexthop group. This check can be relaxed
in the future, if we want to introduce replication fdb nexthop groups
and allow its use in lieu of current rdst lists.
- fdb update requests with nexthop id's only allowed for existing
fdb's that have nexthop id's
- learning will not override an existing fdb entry with nexthop
group
- I have wrapped the switchdev offload code around the presence of
rdst

[1] E-VPN RFC https://tools.ietf.org/html/rfc7432
[2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365
[3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf

Includes a null check fix in vxlan_xmit from Nikolay

v2 - Fixed build issue:
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:00:38 -07:00
Roopa Prabhu
38428d6871 nexthop: support for fdb ecmp nexthops
This patch introduces ecmp nexthops and nexthop groups
for mac fdb entries. In subsequent patches this is used
by the vxlan driver fdb entries. The use case is
E-VPN multihoming [1,2,3] which requires bridged vxlan traffic
to be load balanced to remote switches (vteps) belonging to
the same multi-homed ethernet segment (This is analogous to
a multi-homed LAG but over vxlan).

Changes include new nexthop flag NHA_FDB for nexthops
referenced by fdb entries. These nexthops only have ip.
This patch includes appropriate checks to avoid routes
referencing such nexthops.

example:
$ip nexthop add id 12 via 172.16.1.2 fdb
$ip nexthop add id 13 via 172.16.1.3 fdb
$ip nexthop add id 102 group 12/13 fdb

$bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self

[1] E-VPN https://tools.ietf.org/html/rfc7432
[2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365
[3] LPC talk with mention of nexthop groups for L2 ecmp
http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf

v4 - fixed uninitialized variable reported by kernel test robot
Reported-by: kernel test robot <rong.a.chen@intel.com>

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:00:38 -07:00
Antonio Quartulli
cf78bb0bbc batman-adv: use rcu_replace_pointer() where appropriate
In commit a63fc6b75c ("rcu: Upgrade rcu_swap_protected() to
rcu_replace_pointer()") a new helper macro named rcu_replace_pointer() was
introduced to simplify code requiring to switch an rcu pointer to a new
value while extracting the old one.

Use rcu_replace_pointer() where appropriate to make code slimer.

Signed-off-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-05-22 14:19:24 +02:00
Sven Eckelmann
2092c910e2 batman-adv: Revert "Drop lockdep.h include for soft-interface.c"
The commit 1a33e10e4a ("net: partially revert dynamic lockdep key
changes") reverts the commit ab92d68fc2 ("net: core: add generic lockdep
keys"). But it forgot to also revert the commit 5759af0682 ("batman-adv:
Drop lockdep.h include for soft-interface.c") which depends on the latter.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-05-22 14:19:23 +02:00
Jakub Sitnicki
5cf65922bb flow_dissector: Drop BPF flow dissector prog ref on netns cleanup
When attaching a flow dissector program to a network namespace with
bpf(BPF_PROG_ATTACH, ...) we grab a reference to bpf_prog.

If netns gets destroyed while a flow dissector is still attached, and there
are no other references to the prog, we leak the reference and the program
remains loaded.

Leak can be reproduced by running flow dissector tests from selftests/bpf:

  # bpftool prog list
  # ./test_flow_dissector.sh
  ...
  selftests: test_flow_dissector [PASS]
  # bpftool prog list
  4: flow_dissector  name _dissect  tag e314084d332a5338  gpl
          loaded_at 2020-05-20T18:50:53+0200  uid 0
          xlated 552B  jited 355B  memlock 4096B  map_ids 3,4
          btf_id 4
  #

Fix it by detaching the flow dissector program when netns is going away.

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Link: https://lore.kernel.org/bpf/20200521083435.560256-1-jakub@cloudflare.com
2020-05-21 17:52:45 -07:00
Björn Töpel
26062b185e xsk: Explicitly inline functions and move definitions
In order to reduce the number of function calls, the struct
xsk_buff_pool definition is moved to xsk_buff_pool.h. The functions
xp_get_dma(), xp_dma_sync_for_cpu(), xp_dma_sync_for_device(),
xp_validate_desc() and various helper functions are explicitly
inlined.

Further, move xp_get_handle() and xp_release() to xsk.c, to allow for
the compiler to perform inlining.

rfc->v1: Make sure xp_validate_desc() is inlined for Tx perf. (Maxim)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-15-bjorn.topel@gmail.com
2020-05-21 17:31:27 -07:00
Björn Töpel
82c41671ca xdp: Simplify xdp_return_{frame, frame_rx_napi, buff}
The xdp_return_{frame,frame_rx_napi,buff} function are never used,
except in xdp_convert_zc_to_xdp_frame(), by the MEM_TYPE_XSK_BUFF_POOL
memory type.

To simplify and reduce code, change so that
xdp_convert_zc_to_xdp_frame() calls xsk_buff_free() directly since the
type is know, and remove MEM_TYPE_XSK_BUFF_POOL from the switch
statement in __xdp_return() function.

Suggested-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-14-bjorn.topel@gmail.com
2020-05-21 17:31:27 -07:00
Björn Töpel
0807892ecb xsk: Remove MEM_TYPE_ZERO_COPY and corresponding code
There are no users of MEM_TYPE_ZERO_COPY. Remove all corresponding
code, including the "handle" member of struct xdp_buff.

rfc->v1: Fixed spelling in commit message. (Björn)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-13-bjorn.topel@gmail.com
2020-05-21 17:31:27 -07:00
Björn Töpel
2b43470add xsk: Introduce AF_XDP buffer allocation API
In order to simplify AF_XDP zero-copy enablement for NIC driver
developers, a new AF_XDP buffer allocation API is added. The
implementation is based on a single core (single producer/consumer)
buffer pool for the AF_XDP UMEM.

A buffer is allocated using the xsk_buff_alloc() function, and
returned using xsk_buff_free(). If a buffer is disassociated with the
pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is
said to be released. Currently, the release function is only used by
the AF_XDP internals and not visible to the driver.

Drivers using this API should register the XDP memory model with the
new MEM_TYPE_XSK_BUFF_POOL type.

The API is defined in net/xdp_sock_drv.h.

The buffer type is struct xdp_buff, and follows the lifetime of
regular xdp_buffs, i.e.  the lifetime of an xdp_buff is restricted to
a NAPI context. In other words, the API is not replacing xdp_frames.

In addition to introducing the API and implementations, the AF_XDP
core is migrated to use the new APIs.

rfc->v1: Fixed build errors/warnings for m68k and riscv. (kbuild test
         robot)
         Added headroom/chunk size getter. (Maxim/Björn)

v1->v2: Swapped SoBs. (Maxim)

v2->v3: Initialize struct xdp_buff member frame_sz. (Björn)
        Add API to query the DMA address of a frame. (Maxim)
        Do DMA sync for CPU till the end of the frame to handle
        possible growth (frame_sz). (Maxim)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-6-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Björn Töpel
89e4a376e3 xsk: Move defines only used by AF_XDP internals to xsk.h
Move the XSK_NEXT_PG_CONTIG_{MASK,SHIFT}, and
XDP_UMEM_USES_NEED_WAKEUP defines from xdp_sock.h to the AF_XDP
internal xsk.h file. Also, start using the BIT{,_ULL} macro instead of
explicit shifts.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-5-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Magnus Karlsson
a71506a4fd xsk: Move driver interface to xdp_sock_drv.h
Move the AF_XDP zero-copy driver interface to its own include file
called xdp_sock_drv.h. This, hopefully, will make it more clear for
NIC driver implementors to know what functions to use for zero-copy
support.

v4->v5: Fix -Wmissing-prototypes by include header file. (Jakub)

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-4-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Björn Töpel
d20a1676df xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Sabrina Dubroca
41b4bd986f net: don't return invalid table id error when we fall back to PF_UNSPEC
In case we can't find a ->dumpit callback for the requested
(family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're
in the same situation as if userspace had requested a PF_UNSPEC
dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all
the registered RTM_GETROUTE handlers.

The requested table id may or may not exist for all of those
families. commit ae677bbb44 ("net: Don't return invalid table id
error when dumping all families") fixed the problem when userspace
explicitly requests a PF_UNSPEC dump, but missed the fallback case.

For example, when we pass ipv6.disable=1 to a kernel with
CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y,
the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in
rtnl_dump_all, and listing IPv6 routes will unexpectedly print:

  # ip -6 r
  Error: ipv4: MR table does not exist.
  Dump terminated

commit ae677bbb44 introduced the dump_all_families variable, which
gets set when userspace requests a PF_UNSPEC dump. However, we can't
simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the
fallback case to get dump_all_families == true, because some messages
types (for example RTM_GETRULE and RTM_GETNEIGH) only register the
PF_UNSPEC handler and use the family to filter in the kernel what is
dumped to userspace. We would then export more entries, that userspace
would have to filter. iproute does that, but other programs may not.

Instead, this patch removes dump_all_families and updates the
RTM_GETROUTE handlers to check if the family that is being dumped is
their own. When it's not, which covers both the intentional PF_UNSPEC
dumps (as dump_all_families did) and the fallback case, ignore the
missing table id error.

Fixes: cb167893f4 ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:25:50 -07:00
Vadim Fedorenko
57ebc8f085 net: ipip: fix wrong address family in init error path
In case of error with MPLS support the code is misusing AF_INET
instead of AF_MPLS.

Fixes: 1b69e7e6c4 ("ipip: support MPLS over IPv4")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:24:25 -07:00
Vadim Fedorenko
635d939817 net/tls: free record only on encryption error
We cannot free record on any transient error because it leads to
losing previos data. Check socket error to know whether record must
be freed or not.

Fixes: d10523d0b3 ("net/tls: free the record on encryption error")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:20:06 -07:00
Vadim Fedorenko
a7bff11f6f net/tls: fix encryption error checking
bpf_exec_tx_verdict() can return negative value for copied
variable. In that case this value will be pushed back to caller
and the real error code will be lost. Fix it using signed type and
checking for positive value.

Fixes: d10523d0b3 ("net/tls: free the record on encryption error")
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: Vadim Fedorenko <vfedorenko@novek.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:20:06 -07:00
Oleksij Rempel
8066021915 ethtool: provide UAPI for PHY Signal Quality Index (SQI)
Signal Quality Index is a mandatory value required by "OPEN Alliance
SIG" for the 100Base-T1 PHYs [1]. This indicator can be used for cable
integrity diagnostic and investigating other noise sources and
implement by at least two vendors: NXP[2] and TI[3].

[1] http://www.opensig.org/download/document/218/Advanced_PHY_features_for_automotive_Ethernet_V1.0.pdf
[2] https://www.nxp.com/docs/en/data-sheet/TJA1100.pdf
[3] https://www.ti.com/product/DP83TC811R-Q1

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:18:00 -07:00
Manivannan Sadhasivam
d28ea1fbbf net: qrtr: Fix passing invalid reference to qrtr_local_enqueue()
Once the traversal of the list is completed with list_for_each_entry(),
the iterator (node) will point to an invalid object. So passing this to
qrtr_local_enqueue() which is outside of the iterator block is erroneous
eventhough the object is not used.

So fix this by passing NULL to qrtr_local_enqueue().

Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Reported-by: kbuild test robot <lkp@intel.com>
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:04:53 -07:00
Chris Mi
d8bed686ab net: psample: Add tunnel support
Currently, psample can only send the packet bits after decapsulation.
The tunnel information is lost. Add the tunnel support.

If the sampled packet has no tunnel info, the behavior is the same as
before. If it has, add a nested metadata field named PSAMPLE_ATTR_TUNNEL
and include the tunnel subfields if applicable.

Increase the metadata length for sampled packet with the tunnel info.
If new subfields of tunnel info should be included, update the metadata
length accordingly.

Signed-off-by: Chris Mi <chrism@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:04:07 -07:00
Michal Kubecek
7c87e32d2e ethtool: count header size in reply size estimate
As ethnl_request_ops::reply_size handlers do not include common header
size into calculated/estimated reply size, it needs to be added in
ethnl_default_doit() and ethnl_default_notify() before allocating the
message. On the other hand, strset_reply_size() should not add common
header size.

Fixes: 728480f124 ("ethtool: default handlers for GET requests")
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 16:59:19 -07:00
Jason Gunthorpe
eafd47fc20 Linux 5.7-rc6
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl7BzV8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGg8EH/A2pXMTxtc96RI4S
 sttEsUQqbakFS0Z/2tQPpMGr/qW2e5eHgsTX/a3SiUeZiIXk6f4lMFkMuctzBf7p
 X77cNEDwGOEdbtCXTsMcmKSde7sP2zCXsPB8xTWLyE6rnaFRgikwwkeqgkIKhp1h
 bvOQV0t9HNGvxGAM0iZeOvQAvFl4vd7nS123/MYbir9cugfQUSJRueQ4BiCiJqVE
 6cNA7/vFzDJuFGszzIrJ7HXn/IdQMMWHkvTDjgBw0GZw1mDbGFbfbZwOeTz1ojCt
 smUQ4tIFxBa/VA5zx7dOy2P2keHbSVf4VLkZRPcceT7OqVS65ETmFDp+qt5NdWM5
 vZ8+7/0=
 =CyYH
 -----END PGP SIGNATURE-----

Merge tag 'v5.7-rc6' into rdma.git for-next

Linux 5.7-rc6

Conflict in drivers/net/ethernet/mellanox/mlx5/core/steering/dr_send.c
resolved by deleting dr_cq_event, matching how netdev resolved it.

Required for dependencies in the following patches.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-21 17:08:27 -03:00
J. Bruce Fields
6670ee2ef2 Merge branch 'nfsd-5.8' of git://linux-nfs.org/~cel/cel-2.6 into for-5.8-incoming
Highlights of this series:
* Remove serialization of sending RPC/RDMA Replies
* Convert the TCP socket send path to use xdr_buf::bvecs (pre-requisite for
RPC-on-TLS)
* Fix svcrdma backchannel sendto return code
* Convert a number of dprintk call sites to use tracepoints
* Fix the "suggest braces around empty body in an 'else' statement" warning
2020-05-21 10:58:15 -04:00
Stephen Worley
d69100b8ee net: nlmsg_cancel() if put fails for nhmsg
Fixes data remnant seen when we fail to reserve space for a
nexthop group during a larger dump.

If we fail the reservation, we goto nla_put_failure and
cancel the message.

Reproduce with the following iproute2 commands:
=====================
ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link add dummy3 type dummy
ip link add dummy4 type dummy
ip link add dummy5 type dummy
ip link add dummy6 type dummy
ip link add dummy7 type dummy
ip link add dummy8 type dummy
ip link add dummy9 type dummy
ip link add dummy10 type dummy
ip link add dummy11 type dummy
ip link add dummy12 type dummy
ip link add dummy13 type dummy
ip link add dummy14 type dummy
ip link add dummy15 type dummy
ip link add dummy16 type dummy
ip link add dummy17 type dummy
ip link add dummy18 type dummy
ip link add dummy19 type dummy
ip link add dummy20 type dummy
ip link add dummy21 type dummy
ip link add dummy22 type dummy
ip link add dummy23 type dummy
ip link add dummy24 type dummy
ip link add dummy25 type dummy
ip link add dummy26 type dummy
ip link add dummy27 type dummy
ip link add dummy28 type dummy
ip link add dummy29 type dummy
ip link add dummy30 type dummy
ip link add dummy31 type dummy
ip link add dummy32 type dummy

ip link set dummy1 up
ip link set dummy2 up
ip link set dummy3 up
ip link set dummy4 up
ip link set dummy5 up
ip link set dummy6 up
ip link set dummy7 up
ip link set dummy8 up
ip link set dummy9 up
ip link set dummy10 up
ip link set dummy11 up
ip link set dummy12 up
ip link set dummy13 up
ip link set dummy14 up
ip link set dummy15 up
ip link set dummy16 up
ip link set dummy17 up
ip link set dummy18 up
ip link set dummy19 up
ip link set dummy20 up
ip link set dummy21 up
ip link set dummy22 up
ip link set dummy23 up
ip link set dummy24 up
ip link set dummy25 up
ip link set dummy26 up
ip link set dummy27 up
ip link set dummy28 up
ip link set dummy29 up
ip link set dummy30 up
ip link set dummy31 up
ip link set dummy32 up

ip link set dummy33 up
ip link set dummy34 up

ip link set vrf-red up
ip link set vrf-blue up

ip link set dummyVRFred up
ip link set dummyVRFblue up

ip ro add 1.1.1.1/32 dev dummy1
ip ro add 1.1.1.2/32 dev dummy2
ip ro add 1.1.1.3/32 dev dummy3
ip ro add 1.1.1.4/32 dev dummy4
ip ro add 1.1.1.5/32 dev dummy5
ip ro add 1.1.1.6/32 dev dummy6
ip ro add 1.1.1.7/32 dev dummy7
ip ro add 1.1.1.8/32 dev dummy8
ip ro add 1.1.1.9/32 dev dummy9
ip ro add 1.1.1.10/32 dev dummy10
ip ro add 1.1.1.11/32 dev dummy11
ip ro add 1.1.1.12/32 dev dummy12
ip ro add 1.1.1.13/32 dev dummy13
ip ro add 1.1.1.14/32 dev dummy14
ip ro add 1.1.1.15/32 dev dummy15
ip ro add 1.1.1.16/32 dev dummy16
ip ro add 1.1.1.17/32 dev dummy17
ip ro add 1.1.1.18/32 dev dummy18
ip ro add 1.1.1.19/32 dev dummy19
ip ro add 1.1.1.20/32 dev dummy20
ip ro add 1.1.1.21/32 dev dummy21
ip ro add 1.1.1.22/32 dev dummy22
ip ro add 1.1.1.23/32 dev dummy23
ip ro add 1.1.1.24/32 dev dummy24
ip ro add 1.1.1.25/32 dev dummy25
ip ro add 1.1.1.26/32 dev dummy26
ip ro add 1.1.1.27/32 dev dummy27
ip ro add 1.1.1.28/32 dev dummy28
ip ro add 1.1.1.29/32 dev dummy29
ip ro add 1.1.1.30/32 dev dummy30
ip ro add 1.1.1.31/32 dev dummy31
ip ro add 1.1.1.32/32 dev dummy32

ip next add id 1 via 1.1.1.1 dev dummy1
ip next add id 2 via 1.1.1.2 dev dummy2
ip next add id 3 via 1.1.1.3 dev dummy3
ip next add id 4 via 1.1.1.4 dev dummy4
ip next add id 5 via 1.1.1.5 dev dummy5
ip next add id 6 via 1.1.1.6 dev dummy6
ip next add id 7 via 1.1.1.7 dev dummy7
ip next add id 8 via 1.1.1.8 dev dummy8
ip next add id 9 via 1.1.1.9 dev dummy9
ip next add id 10 via 1.1.1.10 dev dummy10
ip next add id 11 via 1.1.1.11 dev dummy11
ip next add id 12 via 1.1.1.12 dev dummy12
ip next add id 13 via 1.1.1.13 dev dummy13
ip next add id 14 via 1.1.1.14 dev dummy14
ip next add id 15 via 1.1.1.15 dev dummy15
ip next add id 16 via 1.1.1.16 dev dummy16
ip next add id 17 via 1.1.1.17 dev dummy17
ip next add id 18 via 1.1.1.18 dev dummy18
ip next add id 19 via 1.1.1.19 dev dummy19
ip next add id 20 via 1.1.1.20 dev dummy20
ip next add id 21 via 1.1.1.21 dev dummy21
ip next add id 22 via 1.1.1.22 dev dummy22
ip next add id 23 via 1.1.1.23 dev dummy23
ip next add id 24 via 1.1.1.24 dev dummy24
ip next add id 25 via 1.1.1.25 dev dummy25
ip next add id 26 via 1.1.1.26 dev dummy26
ip next add id 27 via 1.1.1.27 dev dummy27
ip next add id 28 via 1.1.1.28 dev dummy28
ip next add id 29 via 1.1.1.29 dev dummy29
ip next add id 30 via 1.1.1.30 dev dummy30
ip next add id 31 via 1.1.1.31 dev dummy31
ip next add id 32 via 1.1.1.32 dev dummy32

i=100

while [ $i -le 200 ]
do
ip next add id $i group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19

	echo $i

	((i++))

done

ip next add id 999 group 1/2/3/4/5/6

ip next ls

========================

Fixes: ab84be7e54 ("net: Initial nexthop code")
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-20 21:00:40 -07:00
Eric Dumazet
687775cec0 ax25: fix setsockopt(SO_BINDTODEVICE)
syzbot was able to trigger this trace [1], probably by using
a zero optlen.

While we are at it, cap optlen to IFNAMSIZ - 1 instead of IFNAMSIZ.

[1]
BUG: KMSAN: uninit-value in strnlen+0xf9/0x170 lib/string.c:569
CPU: 0 PID: 8807 Comm: syz-executor483 Not tainted 5.7.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 strnlen+0xf9/0x170 lib/string.c:569
 dev_name_hash net/core/dev.c:207 [inline]
 netdev_name_node_lookup net/core/dev.c:277 [inline]
 __dev_get_by_name+0x75/0x2b0 net/core/dev.c:778
 ax25_setsockopt+0xfa3/0x1170 net/ax25/af_ax25.c:654
 __compat_sys_setsockopt+0x4ed/0x910 net/compat.c:403
 __do_compat_sys_setsockopt net/compat.c:413 [inline]
 __se_compat_sys_setsockopt+0xdd/0x100 net/compat.c:410
 __ia32_compat_sys_setsockopt+0x62/0x80 net/compat.c:410
 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
 do_fast_syscall_32+0x3bf/0x6d0 arch/x86/entry/common.c:398
 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7f57dd9
Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000ffae8c1c EFLAGS: 00000217 ORIG_RAX: 000000000000016e
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000000101
RDX: 0000000000000019 RSI: 0000000020000000 RDI: 0000000000000004
RBP: 0000000000000012 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Local variable ----devname@ax25_setsockopt created at:
 ax25_setsockopt+0xe6/0x1170 net/ax25/af_ax25.c:536
 ax25_setsockopt+0xe6/0x1170 net/ax25/af_ax25.c:536

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-20 20:59:07 -07:00
Al Viro
0edecc020b atm: switch do_atmif_sioc() to direct use of atm_dev_ioctl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:36 -04:00
Al Viro
8cacb41659 atm: lift copyin from atm_dev_ioctl()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:35 -04:00
Al Viro
36085049bc atm: switch do_atm_iobuf() to direct use of atm_getnames()
... and sod the compat_alloc_user_space() with its complications

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:35 -04:00
Al Viro
a3929484af atm: move copyin from atm_getnames() into the caller
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:34 -04:00
Al Viro
8c2348e36a atm: separate ATM_GETNAMES handling from the rest of atm_dev_ioctl()
atm_dev_ioctl() does copyin in two different ways - one for
ATM_GETNAMES, another for everything else.  Start with separating
the former into a new helper (atm_getnames()).  The next step
will be to lift the copyin into the callers.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:33 -04:00
Al Viro
38c53ca3c1 batadv_socket_read(): get rid of pointless access_ok()
address is passed only to copy_to_user()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:33 -04:00
Al Viro
bbced07d99 get rid of compat_mc_setsockopt()
not used anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:32 -04:00
Al Viro
b212c322c8 handle the group_source_req options directly
Native ->setsockopt() handling of these options (MCAST_..._SOURCE_GROUP
and MCAST_{,UN}BLOCK_SOURCE) consists of copyin + call of a helper that
does the actual work.  The only change needed for ->compat_setsockopt()
is a slightly different copyin - the helpers can be reused as-is.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:32 -04:00
Al Viro
fcfa0b09d3 ipv6: take handling of group_source_req options into a helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:31 -04:00
Al Viro
2bbf8c1ead ipv4: take handling of group_source_req options into a helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:31 -04:00
Al Viro
2f984f11fd ipv[46]: do compat setsockopt for MCAST_{JOIN,LEAVE}_GROUP directly
direct parallel to the way these two are handled in the native
->setsockopt() instances - the helpers that do the real work
are already separated and can be reused as-is in this case.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:30 -04:00
Al Viro
168a2cca81 ipv6: do compat setsockopt for MCAST_MSFILTER directly
similar to the ipv4 counterpart of that patch - the same
trick used to align the tail array properly.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:29 -04:00
Al Viro
d59eb177c8 ip6_mc_msfilter(): pass the address list separately
that way we'll be able to reuse it for compat case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:29 -04:00
Al Viro
2e04172875 ipv4: do compat setsockopt for MCAST_MSFILTER directly
Parallel to what the native setsockopt() does, except that unlike
the native setsockopt() we do not use memdup_user() - we want
the sockaddr_storage fields properly aligned, so we allocate
4 bytes more and copy compat_group_filter at the offset 4,
which yields the proper alignments.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:28 -04:00
Al Viro
e986d4dabc set_mcast_msfilter(): take the guts of setsockopt(MCAST_MSFILTER) into a helper
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:28 -04:00
Al Viro
0dfe6581a7 get rid of compat_mc_getsockopt()
now we can do MCAST_MSFILTER in compat ->getsockopt() without
playing silly buggers with copying things back and forth.
We can form a native struct group_filter (sans the variable-length
tail) on stack, pass that + pointer to the tail of original request
to the helper doing the bulk of the work, then do the rest of
copyout - same as the native getsockopt() does.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:27 -04:00
Al Viro
931ca7ab7f ip*_mc_gsfget(): lift copyout of struct group_filter into callers
pass the userland pointer to the array in its tail, so that part
gets copied out by our functions; copyout of everything else is
done in the callers.  Rationale: reuse for compat; the array
is the same in native and compat, the layout of parts before it
is different for compat.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:27 -04:00
Al Viro
e9c375fb5e compat_ip{,v6}_setsockopt(): enumerate MCAST_... options explicitly
We want to check if optname is among the MCAST_... ones; do that as
an explicit switch.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:26 -04:00
Al Viro
63287de66d lift compat definitions of mcast [sg]etsockopt requests into net/compat.h
We want to get rid of compat_mc_[sg]etsockopt() and to have that stuff
handled without compat_alloc_user_space(), extra copying through
userland, etc.  To do that we'll need ipv4 and ipv6 instances of
->compat_[sg]etsockopt() to manipulate the 32bit variants of mcast
requests, so we need to move the definitions of those out of net/compat.c
and into a public header.

This patch just does a mechanical move to include/net/compat.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:25 -04:00
Chuck Lever
8954c5c212 SUNRPC: Clean up request deferral tracepoints
- Rename these so they are easy to enable and search for as a set
- Move the tracepoints to get a more accurate sense of control flow
- Tracepoints should not fire on xprt shutdown
- Display memory address in case data structure had been corrupted
- Abandon dprintk in these paths

I haven't ever gotten one of these tracepoints to trigger. I wonder
if we should simply remove them.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-20 17:30:44 -04:00
Chuck Lever
fff1ebb269 SUNRPC: Restructure svc_udp_recvfrom()
Clean up. At this point, we are not ready yet to support bio_vecs in
the UDP transport implementation. However, we can clean up
svc_udp_recvfrom() to match the tracing and straight-lining recently
changes made in svc_tcp_recvfrom().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-20 17:30:24 -04:00
Chuck Lever
ca07eda33e SUNRPC: Refactor svc_recvfrom()
This function is not currently "generic" so remove the documenting
comment and rename it appropriately. Its internals are converted to
use bio_vecs for reading from the transport socket.

In existing typical sunrpc uses of bio_vecs, the bio_vec array is
allocated dynamically. Here, instead, an array of bio_vecs is added
to svc_rqst. The lifetime of this array can be greater than one call
to xpo_recvfrom():

- Multiple calls to xpo_recvfrom() might be needed to read an RPC
  message completely.

- At some later point, rq_arg.bvecs will point to this array and it
  will carry the received message into svc_process().

I also expect that a future optimization will remove either the
rq_vec or rq_pages array in favor of rq_bvec, thus conserving the
size of struct svc_rqst.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-20 17:30:12 -04:00
John Hubbard
f78cdbd75a rds: fix crash in rds_info_getsockopt()
The conversion to pin_user_pages() had a bug: it overlooked
the case of allocation of pages failing. Fix that by restoring
an equivalent check.

Reported-by: syzbot+118ac0af4ac7f785a45b@syzkaller.appspotmail.com
Fixes: dbfe7d7437 ("rds: convert get_user_pages() --> pin_user_pages()")

Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-20 14:08:06 -07:00
Chuck Lever
cca557a5a6 SUNRPC: Clean up svc_release_skb() functions
Rename these functions using the convention used for other xpo
method entry points.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-20 15:44:18 -04:00
Chuck Lever
6be8c59491 SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives
Clean up: move exception processing out of the main path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-20 15:43:35 -04:00
Chuck Lever
7dae1dd726 SUNRPC: Replace dprintk() call sites in TCP receive path
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-20 15:30:51 -04:00
Christoph Hellwig
b8d9e7f241 fs: make the pipe_buf_operations ->confirm operation optional
Just return 0 for success if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Christoph Hellwig
76887c2567 fs: make the pipe_buf_operations ->steal operation optional
Just return 1 for failure if it is not present.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 12:11:26 -04:00
Luiz Augusto von Dentz
755dfcbca8 Bluetooth: Fix assuming EIR flags can result in SSP authentication
EIR flags should just hint if SSP may be supported but we shall verify
this with use of the actual features as the SSP bits may be disabled in
the lower layers which would result in legacy authentication to be
used.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-20 16:33:43 +02:00
David Howells
441fdee1ea rxrpc: Fix ack discard
The Rx protocol has a "previousPacket" field in it that is not handled in
the same way by all protocol implementations.  Sometimes it contains the
serial number of the last DATA packet received, sometimes the sequence
number of the last DATA packet received and sometimes the highest sequence
number so far received.

AF_RXRPC is using this to weed out ACKs that are out of date (it's possible
for ACK packets to get reordered on the wire), but this does not work with
OpenAFS which will just stick the sequence number of the last packet seen
into previousPacket.

The issue being seen is that big AFS FS.StoreData RPC (eg. of ~256MiB) are
timing out when partly sent.  A trace was captured, with an additional
tracepoint to show ACKs being discarded in rxrpc_input_ack().  Here's an
excerpt showing the problem.

 52873.203230: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 0002449c q=00024499 fl=09

A DATA packet with sequence number 00024499 has been transmitted (the "q="
field).

 ...
 52873.243296: rxrpc_rx_ack: c=000004ae 00012a2b DLY r=00024499 f=00024497 p=00024496 n=0
 52873.243376: rxrpc_rx_ack: c=000004ae 00012a2c IDL r=0002449b f=00024499 p=00024498 n=0
 52873.243383: rxrpc_rx_ack: c=000004ae 00012a2d OOS r=0002449d f=00024499 p=0002449a n=2

The Out-Of-Sequence ACK indicates that the server didn't see DATA sequence
number 00024499, but did see seq 0002449a (previousPacket, shown as "p=",
skipped the number, but firstPacket, "f=", which shows the bottom of the
window is set at that point).

 52873.252663: rxrpc_retransmit: c=000004ae q=24499 a=02 xp=14581537
 52873.252664: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244bc q=00024499 fl=0b *RETRANS*

The packet has been retransmitted.  Retransmission recurs until the peer
says it got the packet.

 52873.271013: rxrpc_rx_ack: c=000004ae 00012a31 OOS r=000244a1 f=00024499 p=0002449e n=6

More OOS ACKs indicate that the other packets that are already in the
transmission pipeline are being received.  The specific-ACK list is up to 6
ACKs and NAKs.

 ...
 52873.284792: rxrpc_rx_ack: c=000004ae 00012a49 OOS r=000244b9 f=00024499 p=000244b6 n=30
 52873.284802: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=63505500
 52873.284804: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c2 q=00024499 fl=0b *RETRANS*
 52873.287468: rxrpc_rx_ack: c=000004ae 00012a4a OOS r=000244ba f=00024499 p=000244b7 n=31
 52873.287478: rxrpc_rx_ack: c=000004ae 00012a4b OOS r=000244bb f=00024499 p=000244b8 n=32

At this point, the server's receive window is full (n=32) with presumably 1
NAK'd packet and 31 ACK'd packets.  We can't transmit any more packets.

 52873.287488: rxrpc_retransmit: c=000004ae q=24499 a=0a xp=61327980
 52873.287489: rxrpc_tx_data: c=000004ae DATA ed1a3584:00000002 000244c3 q=00024499 fl=0b *RETRANS*
 52873.293850: rxrpc_rx_ack: c=000004ae 00012a4c DLY r=000244bc f=000244a0 p=00024499 n=25

And now we've received an ACK indicating that a DATA retransmission was
received.  7 packets have been processed (the occupied part of the window
moved, as indicated by f= and n=).

 52873.293853: rxrpc_rx_discard_ack: c=000004ae r=00012a4c 000244a0<00024499 00024499<000244b8

However, the DLY ACK gets discarded because its previousPacket has gone
backwards (from p=000244b8, in the ACK at 52873.287478 to p=00024499 in the
ACK at 52873.293850).

We then end up in a continuous cycle of retransmit/discard.  kafs fails to
update its window because it's discarding the ACKs and can't transmit an
extra packet that would clear the issue because the window is full.
OpenAFS doesn't change the previousPacket value in the ACKs because no new
DATA packets are received with a different previousPacket number.

Fix this by altering the discard check to only discard an ACK based on
previousPacket if there was no advance in the firstPacket.  This allows us
to transmit a new packet which will cause previousPacket to advance in the
next ACK.

The check, however, needs to allow for the possibility that previousPacket
may actually have had the serial number placed in it instead - in which
case it will go outside the window and we should ignore it.

Fixes: 1a2391c30c ("rxrpc: Fix detection of out of order acks")
Reported-by: Dave Botsch <botsch@cnf.cornell.edu>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-20 15:33:41 +01:00
David Howells
d1f129470e rxrpc: Trace discarded ACKs
Add a tracepoint to track received ACKs that are discarded due to being
outside of the Tx window.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-20 15:33:41 +01:00
Luiz Augusto von Dentz
3ca44c16b0 Bluetooth: Consolidate encryption handling in hci_encrypt_cfm
This makes hci_encrypt_cfm calls hci_connect_cfm in case the connection
state is BT_CONFIG so callers don't have to check the state.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-20 16:30:33 +02:00
Eric Dumazet
4f65e2f483 net: unexport skb_gro_receive()
skb_gro_receive() used to be used by SCTP, it is no longer the case.

skb_gro_receive_list() is in the same category : never used from modules.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 19:55:36 -07:00
Neil Horman
20a785aa52 sctp: Don't add the shutdown timer if its already been added
This BUG halt was reported a while back, but the patch somehow got
missed:

PID: 2879   TASK: c16adaa0  CPU: 1   COMMAND: "sctpn"
 #0 [f418dd28] crash_kexec at c04a7d8c
 #1 [f418dd7c] oops_end at c0863e02
 #2 [f418dd90] do_invalid_op at c040aaca
 #3 [f418de28] error_code (via invalid_op) at c08631a5
    EAX: f34baac0  EBX: 00000090  ECX: f418deb0  EDX: f5542950  EBP: 00000000
    DS:  007b      ESI: f34ba800  ES:  007b      EDI: f418dea0  GS:  00e0
    CS:  0060      EIP: c046fa5e  ERR: ffffffff  EFLAGS: 00010286
 #4 [f418de5c] add_timer at c046fa5e
 #5 [f418de68] sctp_do_sm at f8db8c77 [sctp]
 #6 [f418df30] sctp_primitive_SHUTDOWN at f8dcc1b5 [sctp]
 #7 [f418df48] inet_shutdown at c080baf9
 #8 [f418df5c] sys_shutdown at c079eedf
 #9 [f418df70] sys_socketcall at c079fe88
    EAX: ffffffda  EBX: 0000000d  ECX: bfceea90  EDX: 0937af98
    DS:  007b      ESI: 0000000c  ES:  007b      EDI: b7150ae4
    SS:  007b      ESP: bfceea7c  EBP: bfceeaa8  GS:  0033
    CS:  0073      EIP: b775c424  ERR: 00000066  EFLAGS: 00000282

It appears that the side effect that starts the shutdown timer was processed
multiple times, which can happen as multiple paths can trigger it.  This of
course leads to the BUG halt in add_timer getting called.

Fix seems pretty straightforward, just check before the timer is added if its
already been started.  If it has mod the timer instead to min(current
expiration, new expiration)

Its been tested but not confirmed to fix the problem, as the issue has only
occured in production environments where test kernels are enjoined from being
installed.  It appears to be a sane fix to me though.  Also, recentely,
Jere found a reproducer posted on list to confirm that this resolves the
issues

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jere.leppanen@nokia.com
CC: marcelo.leitner@gmail.com
CC: netdev@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:46:52 -07:00
Christoph Hellwig
8e3db0bbb2 ipv6: use ->ndo_tunnel_ctl in addrconf_set_dstaddr
Use the new ->ndo_tunnel_ctl instead of overriding the address limit
and using ->ndo_do_ioctl just to do a pointless user copy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:12 -07:00
Christoph Hellwig
68ad6886dd ipv6: streamline addrconf_set_dstaddr
Factor out a addrconf_set_sit_dstaddr helper for the actual work if we
found a SIT device, and only hold the rtnl lock around the device lookup
and that new helper, as there is no point in holding it over a
copy_from_user call.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:12 -07:00
Christoph Hellwig
f098846044 ipv6: stub out even more of addrconf_set_dstaddr if SIT is disabled
There is no point in copying the structure from userspace or looking up
a device if SIT support is not disabled and we'll eventually return
-ENODEV anyway.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:12 -07:00
Christoph Hellwig
f60fe2df93 sit: impement ->ndo_tunnel_ctl
Implement the ->ndo_tunnel_ctl method, and use ip_tunnel_ioctl to
handle userspace requests for the SIOCGETTUNNEL, SIOCADDTUNNEL,
SIOCCHGTUNNEL and SIOCDELTUNNEL ioctls.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:12 -07:00
Christoph Hellwig
fd5d687b76 sit: refactor ipip6_tunnel_ioctl
Split the ioctl handler into one function per command instead of having
a all the logic sit in one giant switch statement.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:12 -07:00
Christoph Hellwig
c7e3670516 impr: use ->ndo_tunnel_ctl in ipmr_new_tunnel
Use the new ->ndo_tunnel_ctl instead of overriding the address limit
and using ->ndo_do_ioctl just to do a pointless user copy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:12 -07:00
Christoph Hellwig
607259a695 net: add a new ndo_tunnel_ioctl method
This method is used to properly allow kernel callers of the IPv4 route
management ioctls.  The exsting ip_tunnel_ioctl helper is renamed to
ip_tunnel_ctl to better reflect that it doesn't directly implement ioctls
touching user memory, and is used for the guts of ndo_tunnel_ctl
implementations. A new ip_tunnel_ioctl helper is added that can be wired
up directly to the ndo_do_ioctl method and takes care of the copy to and
from userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:11 -07:00
Christoph Hellwig
c1fd1182c4 ipv4: consolidate the VIFF_TUNNEL handling in ipmr_new_tunnel
Also move the dev_set_allmulti call and the error handling into the
ioctl helper.  This allows reusing already looked up tunnel_dev pointer
and the set up argument structure for the deletion in the error handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:11 -07:00
Christoph Hellwig
c384b8a70c ipv4: streamline ipmr_new_tunnel
Reduce a few level of indentation to simplify the function.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:11 -07:00
Boris Sukholitko
c0bbbdc32f __netif_receive_skb_core: pass skb by reference
__netif_receive_skb_core may change the skb pointer passed into it (e.g.
in rx_handler). The original skb may be freed as a result of this
operation.

The callers of __netif_receive_skb_core may further process original skb
by using pt_prev pointer returned by __netif_receive_skb_core thus
leading to unpleasant effects.

The solution is to pass skb by reference into __netif_receive_skb_core.

v2: Added Fixes tag and comment regarding ppt_prev and skb invariant.

Fixes: 88eb1944e1 ("net: core: propagate SKB lists through packet_type lookup")
Signed-off-by: Boris Sukholitko <boris.sukholitko@broadcom.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:38:00 -07:00
Martin KaFai Lau
88d7fcfa3b net: inet_csk: Fix so_reuseport bind-address cache in tb->fast*
The commit 637bc8bbe6 ("inet: reset tb->fastreuseport when adding a reuseport sk")
added a bind-address cache in tb->fast*.  The tb->fast* caches the address
of a sk which has successfully been binded with SO_REUSEPORT ON.  The idea
is to avoid the expensive conflict search in inet_csk_bind_conflict().

There is an issue with wildcard matching where sk_reuseport_match() should
have returned false but it is currently returning true.  It ends up
hiding bind conflict.  For example,

bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
bind("[::2]:443"); /* with    SO_REUSEPORT. Succeed. */
bind("[::]:443");  /* with    SO_REUSEPORT. Still Succeed where it shouldn't */

The last bind("[::]:443") with SO_REUSEPORT on should have failed because
it should have a conflict with the very first bind("[::1]:443") which
has SO_REUSEPORT off.  However, the address "[::2]" is cached in
tb->fast* in the second bind. In the last bind, the sk_reuseport_match()
returns true because the binding sk's wildcard addr "[::]" matches with
the "[::2]" cached in tb->fast*.

The correct bind conflict is reported by removing the second
bind such that tb->fast* cache is not involved and forces the
bind("[::]:443") to go through the inet_csk_bind_conflict():

bind("[::1]:443"); /* without SO_REUSEPORT. Succeed. */
bind("[::]:443");  /* with    SO_REUSEPORT. -EADDRINUSE */

The expected behavior for sk_reuseport_match() is, it should only allow
the "cached" tb->fast* address to be used as a wildcard match but not
the address of the binding sk.  To do that, the current
"bool match_wildcard" arg is split into
"bool match_sk1_wildcard" and "bool match_sk2_wildcard".

This change only affects the sk_reuseport_match() which is only
used by inet_csk (e.g. TCP).
The other use cases are calling inet_rcv_saddr_equal() and
this patch makes it pass the same "match_wildcard" arg twice to
the "ipv[46]_rcv_saddr_equal(..., match_wildcard, match_wildcard)".

Cc: Josef Bacik <jbacik@fb.com>
Fixes: 637bc8bbe6 ("inet: reset tb->fastreuseport when adding a reuseport sk")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:36:21 -07:00
Julian Wiedmann
e9a36ca5f6 net/af_iucv: clean up function prototypes
Remove a bunch of forward declarations (trivially shifting code around
where needed), and make a few functions static.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:50:14 -07:00
Julian Wiedmann
dca1262f97 net/af_iucv: remove a redundant zero initialization
txmsg is declared as {0}, no need to clear individual fields later on.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:50:14 -07:00
Julian Wiedmann
0d1c7664ed net/af_iucv: replace open-coded U16_MAX
Improve the readability of a range check.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:50:14 -07:00
Julian Wiedmann
585bc22095 net/af_iucv: remove pm support
commit 394216275c ("s390: remove broken hibernate / power management support")
removed support for ARCH_HIBERNATION_POSSIBLE from s390.

So drop the unused pm ops from the s390-only af_iucv socket code.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:50:14 -07:00
Julian Wiedmann
4b32f86bf1 net/iucv: remove pm support
commit 394216275c ("s390: remove broken hibernate / power management support")
removed support for ARCH_HIBERNATION_POSSIBLE from s390.

So drop the unused pm ops from the s390-only iucv bus driver.

CC: Hendrik Brueckner <brueckner@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:50:14 -07:00
Todd Malsbary
12555a2d97 mptcp: use rightmost 64 bits in ADD_ADDR HMAC
This changes the HMAC used in the ADD_ADDR option from the leftmost 64
bits to the rightmost 64 bits as described in RFC 8684, section 3.4.1.

This issue was discovered while adding support to packetdrill for the
ADD_ADDR v1 option.

Fixes: 3df523ab58 ("mptcp: Add ADD_ADDR handling")
Signed-off-by: Todd Malsbary <todd.malsbary@linux.intel.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:35:51 -07:00
Daniel Borkmann
1b66d25361 bpf: Add get{peer, sock}name attach types for sock_addr
As stated in 983695fa67 ("bpf: fix unconnected udp hooks"), the objective
for the existing cgroup connect/sendmsg/recvmsg/bind BPF hooks is to be
transparent to applications. In Cilium we make use of these hooks [0] in
order to enable E-W load balancing for existing Kubernetes service types
for all Cilium managed nodes in the cluster. Those backends can be local
or remote. The main advantage of this approach is that it operates as close
as possible to the socket, and therefore allows to avoid packet-based NAT
given in connect/sendmsg/recvmsg hooks we only need to xlate sock addresses.

This also allows to expose NodePort services on loopback addresses in the
host namespace, for example. As another advantage, this also efficiently
blocks bind requests for applications in the host namespace for exposed
ports. However, one missing item is that we also need to perform reverse
xlation for inet{,6}_getname() hooks such that we can return the service
IP/port tuple back to the application instead of the remote peer address.

The vast majority of applications does not bother about getpeername(), but
in a few occasions we've seen breakage when validating the peer's address
since it returns unexpectedly the backend tuple instead of the service one.
Therefore, this trivial patch allows to customise and adds a getpeername()
as well as getsockname() BPF cgroup hook for both IPv4 and IPv6 in order
to address this situation.

Simple example:

  # ./cilium/cilium service list
  ID   Frontend     Service Type   Backend
  1    1.2.3.4:80   ClusterIP      1 => 10.0.0.10:80

Before; curl's verbose output example, no getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (10.0.0.10) port 80 (#0)
  > GET / HTTP/1.1
  > Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

After; with getpeername() reverse xlation:

  # curl --verbose 1.2.3.4
  * Rebuilt URL to: 1.2.3.4/
  *   Trying 1.2.3.4...
  * TCP_NODELAY set
  * Connected to 1.2.3.4 (1.2.3.4) port 80 (#0)
  > GET / HTTP/1.1
  >  Host: 1.2.3.4
  > User-Agent: curl/7.58.0
  > Accept: */*
  [...]

Originally, I had both under a BPF_CGROUP_INET{4,6}_GETNAME type and exposed
peer to the context similar as in inet{,6}_getname() fashion, but API-wise
this is suboptimal as it always enforces programs having to test for ctx->peer
which can easily be missed, hence BPF_CGROUP_INET{4,6}_GET{PEER,SOCK}NAME split.
Similarly, the checked return code is on tnum_range(1, 1), but if a use case
comes up in future, it can easily be changed to return an error code instead.
Helper and ctx member access is the same as with connect/sendmsg/etc hooks.

  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Andrey Ignatov <rdna@fb.com>
Link: https://lore.kernel.org/bpf/61a479d759b2482ae3efb45546490bacd796a220.1589841594.git.daniel@iogearbox.net
2020-05-19 11:32:04 -07:00
Jesper Dangaard Brouer
d800bad67d bpf: Fix too large copy from user in bpf_test_init
Commit bc56c919fc ("bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().")
recently changed bpf_prog_test_run_xdp() to use larger frames for XDP in
order to test tail growing frames (via bpf_xdp_adjust_tail) and to have
memory backing frame better resemble drivers.

The commit contains a bug, as it tries to copy the max data size from
userspace, instead of the size provided by userspace.  This cause XDP
unit tests to fail sporadically with EFAULT, an unfortunate behavior.
The fix is to only copy the size specified by userspace.

Fixes: bc56c919fc ("bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/158980712729.256597.6115007718472928659.stgit@firesoul
2020-05-19 17:56:34 +02:00
Alexey Gladkov
9d78edeaec proc: proc_pid_ns takes super_block as an argument
syzbot found that

  touch /proc/testfile

causes NULL pointer dereference at tomoyo_get_local_path()
because inode of the dentry is NULL.

Before c59f415a7c, Tomoyo received pid_ns from proc's s_fs_info
directly. Since proc_pid_ns() can only work with inode, using it in
the tomoyo_get_local_path() was wrong.

To avoid creating more functions for getting proc_ns, change the
argument type of the proc_pid_ns() function. Then, Tomoyo can use
the existing super_block to get pid_ns.

Link: https://lkml.kernel.org/r/0000000000002f0c7505a5b0e04c@google.com
Link: https://lkml.kernel.org/r/20200518180738.2939611-1-gladkov.alexey@gmail.com
Reported-by: syzbot+c1af344512918c61362c@syzkaller.appspotmail.com
Fixes: c59f415a7c ("Use proc_pid_ns() to get pid_namespace from the proc superblock")
Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com>
Signed-off-by: Eric W. Biederman <ebiederm@xmission.com>
2020-05-19 07:07:50 -05:00
Christoph Hellwig
dc13c8761c ipv4,appletalk: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl
To prepare removing the global routing_ioctl hack start lifting the code
into the ipv4 and appletalk ->compat_ioctl handlers.  Unlike the existing
handler we don't bother copying in the name - there are no compat issues for
char arrays.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Christoph Hellwig
a500492354 appletalk: factor out a atrtr_ioctl_addrt helper
Add a helper than can be shared with the upcoming compat ioctl handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Christoph Hellwig
3986912f6a ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl
To prepare removing the global routing_ioctl hack start lifting the code
into a newly added ipv6 ->compat_ioctl handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Christoph Hellwig
7c1552da90 ipv6: lift copy_from_user out of ipv6_route_ioctl
Prepare for better compat ioctl handling by moving the user copy out
of ipv6_route_ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Chuck Lever
a5cda73e49 SUNRPC: Restructure svc_tcp_recv_record()
Refactor: svc_recvfrom() is going to be converted to read into
bio_vecs in a moment. Unhook the only other caller,
svc_tcp_recv_record(), which just wants to read the 4-byte stream
record marker into a kvec.

While we're in the area, streamline this helper by straight-lining
the hot path, replace dprintk call sites with tracepoints, and
reduce the number of atomic bit operations in this path.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:23 -04:00
Chuck Lever
02648908d1 SUNRPC: Rename svc_sock::sk_reclen
Clean up. I find the name of the svc_sock::sk_reclen field
confusing, so I've changed it to better reflect its function. This
field is not read directly to get the record length. Rather, it is
a buffer containing a record marker that needs to be decoded.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:23 -04:00
Chuck Lever
b4af59328c SUNRPC: Trace server-side rpcbind registration events
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
a0469f46fa SUNRPC: Replace dprintk call sites in TCP state change callouts
Report TCP socket state changes and accept failures via
tracepoints, replacing dprintk() call sites.

No tracepoint is added in svc_tcp_listen_data_ready. There's
no information available there that isn't also reported by the
svcsock_new_socket and the accept failure tracepoints.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
998024dee1 SUNRPC: Add more svcsock tracepoints
In addition to tracing recently-updated socket sendto events, this
commit adds a trace event class that can be used for additional
svcsock-related tracepoints in subsequent patches.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
d998882b4b SUNRPC: Remove "#include <trace/events/skb.h>"
Clean up: Commit 850cbaddb5 ("udp: use it's own memory accounting
schema") removed the last skb-related tracepoint from svcsock.c, so
it is no longer necessary to include trace/events/skb.h.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
11bbb0f76e SUNRPC: Trace a few more generic svc_xprt events
In lieu of dprintks or tracepoints in each individual transport
implementation, introduce tracepoints in the generic part of the RPC
layer. These typically fire for connection lifetime events, so
shouldn't contribute a lot of noise.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
4b8f380e46 SUNRPC: Tracepoint to record errors in svc_xpo_create()
Capture transport creation failures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
e979a173a0 svcrdma: Add tracepoints to report ->xpo_accept failures
Failure to accept a connection is typically due to a problem
specific to a transport type. Also, ->xpo_accept returns NULL
on error rather than reporting a specific problem.

So, add failure-specific tracepoints in svc_rdma_accept().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
decc13f7eb svcrdma: Displayed remote IP address should match stored address
Clean up: After commit 1e091c3bbf ("svcrdma: Ignore source port
when computing DRC hash"), the IP address stored in xpt_remote
always has a port number of zero. Thus, there's no need to display
the port number when displaying the IP address of a remote NFS/RDMA
client.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:22 -04:00
Chuck Lever
27ce629444 svcrdma: Rename tracepoints that record header decoding errors
Clean up: Use a consistent naming convention so that these trace
points can be enabled quickly via a glob.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
f5046b8f43 svcrdma: Remove backchannel dprintk call sites
Clean up.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
ea740bd5f5 svcrdma: Fix backchannel return code
Way back when I was writing the RPC/RDMA server-side backchannel
code, I misread the TCP backchannel reply handler logic. When
svc_tcp_recvfrom() successfully receives a backchannel reply, it
does not return -EAGAIN. It sets XPT_DATA and returns zero.

Update svc_rdma_recvfrom() to return zero. Here, XPT_DATA doesn't
need to be set again: it is set whenever a new message is received,
behind a spin lock in a single threaded context.

Also, if handling the cb reply is not successful, the message is
simply dropped. There's no special message framing to deal with as
there is in the TCP case.

Now that the handle_bc_reply() return value is ignored, I've removed
the dprintk call sites in the error exit of handle_bc_reply() in
favor of trace points in other areas that already report the error
cases.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
dbc17acd5d svcrdma: trace undersized Write chunks
Clean up: Replace a dprintk call site.

This is the last remaining dprintk call site in svc_rdma_rw.c, so
remove dprintk infrastructure as well.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
9d20063892 svcrdma: Trace page overruns when constructing RDMA Reads
Clean up: Replace a dprintk call site with a tracepoint.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
f4e53e1ce3 svcrdma: Clean up handling of get_rw_ctx errors
Clean up: Replace two dprintk call sites with a tracepoint.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
2abfbe7e3a svcrdma: Clean up the tracing for rw_ctx_init errors
- De-duplicate code
- Rename the tracepoint with "_err" to allow enabling via glob
- Report the sg_cnt for the failing rw_ctx
- Fix a dumb signage issue

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Chuck Lever
ca4faf543a SUNRPC: Move xpt_mutex into socket xpo_sendto methods
It appears that the RPC/RDMA transport does not need serialization
of calls to its xpo_sendto method. Move the mutex into the socket
methods that still need that serialization.

Tail latencies are unambiguously better with this patch applied.
fio randrw 8KB 70/30 on NFSv3, smaller numbers are better:

    clat percentiles (usec):

With xpt_mutex:
r    | 99.99th=[ 8848]
w    | 99.99th=[ 9634]

Without xpt_mutex:
r    | 99.99th=[ 8586]
w    | 99.99th=[ 8979]

Serializing the construction of RPC/RDMA transport headers is not
really necessary at this point, because the Linux NFS server
implementation never changes its credit grant on a connection. If
that should change, then svc_rdma_sendto will need to serialize
access to the transport's credit grant fields.

Reported-by: kbuild test robot <lkp@intel.com>
[ cel: fix uninitialized variable warning ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-05-18 10:21:21 -04:00
Hsin-Yu Chao
56b5453a86 Bluetooth: Add SCO fallback for invalid LMP parameters error
Bluetooth PTS test case HFP/AG/ACC/BI-12-I accepts SCO connection
with invalid parameter at the first SCO request expecting AG to
attempt another SCO request with the use of "safe settings" for
given codec, base on section 5.7.1.2 of HFP 1.7 specification.

This patch addresses it by adding "Invalid LMP Parameters" (0x1e)
to the SCO fallback case. Verified with below log:

< HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
        Handle: 256
        Transmit bandwidth: 8000
        Receive bandwidth: 8000
        Max latency: 13
        Setting: 0x0003
          Input Coding: Linear
          Input Data Format: 1's complement
          Input Sample Size: 8-bit
          # of bits padding at MSB: 0
          Air Coding Format: Transparent Data
        Retransmission effort: Optimize for link quality (0x02)
        Packet type: 0x0380
          3-EV3 may not be used
          2-EV5 may not be used
          3-EV5 may not be used
> HCI Event: Command Status (0x0f) plen 4
      Setup Synchronous Connection (0x01|0x0028) ncmd 1
        Status: Success (0x00)
> HCI Event: Number of Completed Packets (0x13) plen 5
        Num handles: 1
        Handle: 256
        Count: 1
> HCI Event: Max Slots Change (0x1b) plen 3
        Handle: 256
        Max slots: 1
> HCI Event: Synchronous Connect Complete (0x2c) plen 17
        Status: Invalid LMP Parameters / Invalid LL Parameters (0x1e)
        Handle: 0
        Address: 00:1B:DC:F2:21:59 (OUI 00-1B-DC)
        Link type: eSCO (0x02)
        Transmission interval: 0x00
        Retransmission window: 0x02
        RX packet length: 0
        TX packet length: 0
        Air mode: Transparent (0x03)
< HCI Command: Setup Synchronous Connection (0x01|0x0028) plen 17
        Handle: 256
        Transmit bandwidth: 8000
        Receive bandwidth: 8000
        Max latency: 8
        Setting: 0x0003
          Input Coding: Linear
          Input Data Format: 1's complement
          Input Sample Size: 8-bit
          # of bits padding at MSB: 0
          Air Coding Format: Transparent Data
        Retransmission effort: Optimize for link quality (0x02)
        Packet type: 0x03c8
          EV3 may be used
          2-EV3 may not be used
          3-EV3 may not be used
          2-EV5 may not be used
          3-EV5 may not be used
> HCI Event: Command Status (0x0f) plen 4
      Setup Synchronous Connection (0x01|0x0028) ncmd 1
        Status: Success (0x00)
> HCI Event: Max Slots Change (0x1b) plen 3
        Handle: 256
        Max slots: 5
> HCI Event: Max Slots Change (0x1b) plen 3
        Handle: 256
        Max slots: 1
> HCI Event: Synchronous Connect Complete (0x2c) plen 17
        Status: Success (0x00)
        Handle: 257
        Address: 00:1B:DC:F2:21:59 (OUI 00-1B-DC)
        Link type: eSCO (0x02)
        Transmission interval: 0x06
        Retransmission window: 0x04
        RX packet length: 30
        TX packet length: 30
        Air mode: Transparent (0x03)

Signed-off-by: Hsin-Yu Chao <hychao@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-18 10:00:22 +02:00
Łukasz Rymanowski
49c06c9eb1 Bluetooth: Fix for GAP/SEC/SEM/BI-10-C
Security Mode 1 level 4, force us to use have key size 16 octects long.
This patch adds check for that.

This is required for the qualification test GAP/SEC/SEM/BI-10-C

Logs from test when ATT is configured with sec level BT_SECURITY_FIPS

< ACL Data TX: Handle 3585 flags 0x00 dlen 11                                                                       #28 [hci0] 3.785965
      SMP: Pairing Request (0x01) len 6
        IO capability: DisplayYesNo (0x01)
        OOB data: Authentication data not present (0x00)
        Authentication requirement: Bonding, MITM, SC, No Keypresses (0x0d)
        Max encryption key size: 16
        Initiator key distribution: EncKey Sign (0x05)
        Responder key distribution: EncKey IdKey Sign (0x07)
> ACL Data RX: Handle 3585 flags 0x02 dlen 11                                                                       #35 [hci0] 3.883020
      SMP: Pairing Response (0x02) len 6
        IO capability: DisplayYesNo (0x01)
        OOB data: Authentication data not present (0x00)
        Authentication requirement: Bonding, MITM, SC, No Keypresses (0x0d)
        Max encryption key size: 7
        Initiator key distribution: EncKey Sign (0x05)
        Responder key distribution: EncKey IdKey Sign (0x07)
< ACL Data TX: Handle 3585 flags 0x00 dlen 6                                                                        #36 [hci0] 3.883136
      SMP: Pairing Failed (0x05) len 1
        Reason: Encryption key size (0x06)

Signed-off-by: Łukasz Rymanowski <lukasz.rymanowski@codecoup.pl>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-18 09:58:53 +02:00
Xin Long
3ffb93ba32 esp4: improve xfrm4_beet_gso_segment() to be more readable
This patch is to improve the code to make xfrm4_beet_gso_segment()
more readable, and keep consistent with xfrm6_beet_gso_segment().

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-18 09:55:19 +02:00
John Hubbard
dbfe7d7437 rds: convert get_user_pages() --> pin_user_pages()
This code was using get_user_pages_fast(), in a "Case 2" scenario
(DMA/RDMA), using the categorization from [1]. That means that it's
time to convert the get_user_pages_fast() + put_page() calls to
pin_user_pages_fast() + unpin_user_pages() calls.

There is some helpful background in [2]: basically, this is a small
part of fixing a long-standing disconnect between pinning pages, and
file systems' use of those pages.

[1] Documentation/core-api/pin_user_pages.rst

[2] "Explicit pinning of user-space pages":
    https://lwn.net/Articles/807108/

Cc: David S. Miller <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: netdev@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:37:45 -07:00
Florian Westphal
4930f4831b net: allow __skb_ext_alloc to sleep
mptcp calls this from the transmit side, from process context.
Allow a sleeping allocation instead of unconditional GFP_ATOMIC.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal
5c8264435d mptcp: remove inner wait loop from mptcp_sendmsg_frag
previous patches made sure we only call into this function
when these prerequisites are met, so no need to wait on the
subflow socket anymore.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/7
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal
17091708d1 mptcp: fill skb page frag cache outside of mptcp_sendmsg_frag
The mptcp_sendmsg_frag helper contains a loop that will wait on the
subflow sk.

It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
requested.  mptcp_sendmsg already has such a wait loop that is used when
no subflow socket is available for transmission.

This is another preparation patch that makes sure we call
mptcp_sendmsg_frag only if the page frag cache has been refilled.

Followup patch will remove the wait loop from mptcp_sendmsg_frag().

The retransmit worker doesn't need to do this refill as it won't
transmit new mptcp-level data.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal
149f7c71e2 mptcp: fill skb extension cache outside of mptcp_sendmsg_frag
The mptcp_sendmsg_frag helper contains a loop that will wait on the
subflow sk.

It seems preferrable to only wait in mptcp_sendmsg() when blocking io is
requested.  mptcp_sendmsg already has such a wait loop that is used when
no subflow socket is available for transmission.

This is a preparation patch that makes sure we call
mptcp_sendmsg_frag only if a skb extension has been allocated.

Moreover, such allocation currently uses GFP_ATOMIC while it
could use sleeping allocation instead.

Followup patches will remove the wait loop from mptcp_sendmsg_frag()
and will allow to do a sleeping allocation for the extension.

Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal
72511aab95 mptcp: avoid blocking in tcp_sendpages
The transmit loop continues to xmit new data until an error is returned
or all data was transmitted.

For the blocking i/o case, this means that tcp_sendpages() may block on
the subflow until more space becomes available, i.e. we end up sleeping
with the mptcp socket lock held.

Instead we should check if a different subflow is ready to be used.

This restarts the subflow sk lookup when the tx operation succeeded
and the tcp subflow can't accept more data or if tcp_sendpages
indicates -EAGAIN on a blocking mptcp socket.

In that case we also need to set the NOSPACE bit to make sure we get
notified once memory becomes available.

In case all subflows are busy, the existing logic will wait until a
subflow is ready, releasing the mptcp socket lock while doing so.

The mptcp worker already sets DONTWAIT, so no need to make changes there.

v2:
 * set NOSPACE bit
 * add a comment to clarify that mptcp-sk sndbuf limits need to
   be checked as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal
fb529e62d3 mptcp: break and restart in case mptcp sndbuf is full
Its not enough to check for available tcp send space.

We also hold on to transmitted data for mptcp-level retransmits.
Right now we will send more and more data if the peer can ack data
at the tcp level fast enough, since that frees up tcp send buffer space.

But we also need to check that data was acked and reclaimed at the mptcp
level.

Therefore add needed check in mptcp_sendmsg, flush tcp data and
wait until more mptcp snd space becomes available if we are over the
limit.  Before we wait for more data, also make sure we start the
retransmit timer if we ran out of sndbuf space.

Otherwise there is a very small chance that we wait forever:

 * receiver is waiting for data
 * sender is blocked because mptcp socket buffer is full
 * at tcp level, all data was acked
 * mptcp-level snd_una was not updated, because last ack
   that acknowledged the last data packet carried an older
   MPTCP-ack.

Restarting the retransmit timer avoids this problem: if TCP
subflow is idle, data is retransmitted from the RTX queue.

New data will make the peer send a new, updated MPTCP-Ack.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Florian Westphal
a0e17064d4 mptcp: move common nospace-pattern to a helper
Paolo noticed that ssk_check_wmem() has same pattern, so add/use
common helper for both places.

Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:35:34 -07:00
Yuqi Jin
a6211caa63 net: revert "net: get rid of an signed integer overflow in ip_idents_reserve()"
Commit adb03115f4 ("net: get rid of an signed integer overflow in ip_idents_reserve()")
used atomic_cmpxchg to replace "atomic_add_return" inside the function
"ip_idents_reserve". The reason was to avoid UBSAN warning.
However, this change has caused performance degrade and in GCC-8,
fno-strict-overflow is now mapped to -fwrapv -fwrapv-pointer
and signed integer overflow is now undefined by default at all
optimization levels[1]. Moreover, it was a bug in UBSAN vs -fwrapv
/-fno-strict-overflow, so Let's revert it safely.

[1] https://gcc.gnu.org/gcc-8/changes.html

Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Arvind Sankar <nivedita@alum.mit.edu>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Jiong Wang <jiongwang@huawei.com>
Signed-off-by: Yuqi Jin <jinyuqi@huawei.com>
Signed-off-by: Shaokun Zhang <zhangshaokun@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 12:33:51 -07:00
David Ahern
84be69b869 nexthop: Fix attribute checking for groups
For nexthop groups, attributes after NHA_GROUP_TYPE are invalid, but
nh_check_attr_group starts checking at NHA_GROUP. The group type defaults
to multipath and the NHA_GROUP_TYPE is currently optional so this has
slipped through so far. Fix the attribute checking to handle support of
new group types.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: ASSOGBA Emery <assogba.emery@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-17 10:41:24 -07:00
Masahiro Yamada
8a2cc0505c bpfilter: use 'userprogs' syntax to build bpfilter_umh
The user mode helper should be compiled for the same architecture as
the kernel.

This Makefile reused the 'hostprogs' syntax by overriding HOSTCC with CC.
Use the new syntax 'userprogs' to fix the Makefile mess.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Sam Ravnborg <sam@ravnborg.org>
2020-05-17 18:52:01 +09:00
Masahiro Yamada
b1183b6dca bpfilter: check if $(CC) can link static libc in Kconfig
On Fedora, linking static glibc requires the glibc-static RPM package,
which is not part of the glibc-devel package.

CONFIG_CC_CAN_LINK does not check the capability of static linking,
so you can enable CONFIG_BPFILTER_UMH, then fail to build:

    HOSTLD  net/bpfilter/bpfilter_umh
  /usr/bin/ld: cannot find -lc
  collect2: error: ld returned 1 exit status

Add CONFIG_CC_CAN_LINK_STATIC, and make CONFIG_BPFILTER_UMH depend
on it.

Reported-by: Valdis Kletnieks <valdis.kletnieks@vt.edu>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Alexei Starovoitov <ast@kernel.org>
2020-05-17 18:52:01 +09:00
Masahiro Yamada
9371f86ecb bpfilter: match bit size of bpfilter_umh to that of the kernel
bpfilter_umh is built for the default machine bit of the compiler,
which may not match to the bit size of the kernel.

This happens in the scenario below:

You can use biarch GCC that defaults to 64-bit for building the 32-bit
kernel. In this case, Kbuild passes -m32 to teach the compiler to
produce 32-bit kernel space objects. However, it is missing when
building bpfilter_umh. It is built as a 64-bit ELF, and then embedded
into the 32-bit kernel.

The 32-bit kernel and 64-bit umh is a bad combination.

In theory, we can have 32-bit umh running on 64-bit kernel, but we do
not have a good reason to support such a usecase.

The best is to match the bit size between them.

Pass -m32 or -m64 to the umh build command if it is found in
$(KBUILD_CFLAGS). Evaluate CC_CAN_LINK against the kernel bit-size.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-05-17 18:52:01 +09:00
Jakub Kicinski
75c36dbb1c ethtool: don't call set_channels in drivers if config didn't change
Don't call drivers if nothing changed. Netlink code already
contains this logic.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:56:30 -07:00
Jakub Kicinski
7be92514b9 ethtool: check if there is at least one channel for TX/RX in the core
Having a channel config with no ability to RX or TX traffic is
clearly wrong. Check for this in the core so the drivers don't
have to.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:56:30 -07:00
Christoph Paasch
a0c1d0eafd mptcp: Use 32-bit DATA_ACK when possible
RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not
sending 64-bit data-sequence numbers. The 64-bit DSN is only there for
extreme scenarios when a very high throughput subflow is combined with a
long-RTT subflow such that the high-throughput subflow wraps around the
32-bit sequence number space within an RTT of the high-RTT subflow.

It is thus a rare scenario and we should try to use the 32-bit DATA_ACK
instead as long as possible. It allows to reduce the TCP-option overhead
by 4 bytes, thus makes space for an additional SACK-block. It also makes
tcpdumps much easier to read when the DSN and DATA_ACK are both either
32 or 64-bit.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:51:10 -07:00
DENG Qingfang
5e5502e012 net: dsa: mt7530: fix roaming from DSA user ports
When a client moves from a DSA user port to a software port in a bridge,
it cannot reach any other clients that connected to the DSA user ports.
That is because SA learning on the CPU port is disabled, so the switch
ignores the client's frames from the CPU port and still thinks it is at
the user port.

Fix it by enabling SA learning on the CPU port.

To prevent the switch from learning from flooding frames from the CPU
port, set skb->offload_fwd_mark to 1 for unicast and broadcast frames,
and let the switch flood them instead of trapping to the CPU port.
Multicast frames still need to be trapped to the CPU port for snooping,
so set the SA_DIS bit of the MTK tag to 1 when transmitting those frames
to disable SA learning.

Fixes: b8f126a8d5 ("net-next: dsa: add dsa support for Mediatek MT7530 switch")
Signed-off-by: DENG Qingfang <dqfext@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:49:28 -07:00
Nicolas Dichtel
9efd6a3cec netns: enable to inherit devconf from current netns
The goal is to be able to inherit the initial devconf parameters from the
current netns, ie the netns where this new netns has been created.

This is useful in a containers environment where /proc/sys is read only.
For example, if a pod is created with specifics devconf parameters and has
the capability to create netns, the user expects to get the same parameters
than his 'init_net', which is not the real init_net in this case.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:46:37 -07:00
Madhuparna Bhowmik
b6dd5acde3 ipv6: Fix suspicious RCU usage warning in ip6mr
This patch fixes the following warning:

=============================
WARNING: suspicious RCU usage
5.7.0-rc4-next-20200507-syzkaller #0 Not tainted
-----------------------------
net/ipv6/ip6mr.c:124 RCU-list traversed in non-reader section!!

ipmr_new_table() returns an existing table, but there is no table at
init. Therefore the condition: either holding rtnl or the list is empty
is used.

Fixes: d1db275dd3 ("ipv6: ip6mr: support multiple tables")
Reported-by: kernel test robot <lkp@intel.com>
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:41:53 -07:00
Linus Torvalds
12bf0b632e NFS client bugfixes for Linux 5.7
Highlights include:
 
 Stable fixes:
 - nfs: fix NULL deference in nfs4_get_valid_delegation
 
 Bugfixes:
 - Fix corruption of the return value in cachefiles_read_or_alloc_pages()
 - Fix several fscache cookie issues
 - Fix a fscache queuing race that can trigger a BUG_ON
 - NFS: Fix 2 use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
 - SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
 - SUNRPC: Fix a race when tearing down the rpc client debugfs directory
 - SUNRPC: Signalled ASYNC tasks need to exit
 - NFSv3: fix rpc receive buffer size for MOUNT call
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6/AkgACgkQZwvnipYK
 APICLQ/9ENY+mQmVMSbtw2VGlBphS+GLN44k70NgmjAaI9n8f3ILZyLl0/iHEPFU
 ZaCWhPWEs76olLfCWwoQFzISIUTHWVUow8mJn7gTh4mq8n9UFv8zgUxtJ3HTfEHt
 rtujG9xsK0Oa351UPJWE5yC7PvFzMohcc1vVD8WQeGJQ3sMBVHTuOuPatIv3vK6s
 8MflKTbtz/gUkcbWMCk9ljMPXr6/Ksgu9GZDnDFAYZBfFkwx//RNmq6K+z1Ru15s
 tkmPPZGNMCfKblrnUXUmPt78wxSExmWrXroSMNas2fyeOXgPL0ogNx8vfdFcFsxs
 sHpMkF97+npntNj1y5om6GrdU4SRYUFqXT7pNqV4wuGguOfFELIXrJIQuCDaKrGD
 ApEoo9UDisGCLdqs738ascZFHZiTQoy6drbpR8moalqhYkTI7Al/pPxHnozGDYsJ
 +wElaFXZX2hlPc6ih1q54RcB+D4qswDC9QudArKc9hJEKPv+SsmiVhBG/f+X+Jca
 M19UJGWZvRtY8L+0yJdG22O9Hwo0zSK917gtOZNgkwtkkKgzjj2kcNHTK2jGBEuR
 pqEIQCreUH8Le0WR9cPeJeYc2/HeCEWDHrf/q+gFRClaZMe+0Sfu3z3pBH3v0T9q
 qoZ1VvCDUhggsZyJZebcPTCL+ghqOXFJajHLlcA6BSdGJ/iXq2M=
 =lqFu
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - nfs: fix NULL deference in nfs4_get_valid_delegation

  Bugfixes:
   - Fix corruption of the return value in cachefiles_read_or_alloc_pages()
   - Fix several fscache cookie issues
   - Fix a fscache queuing race that can trigger a BUG_ON
   - NFS: Fix two use-after-free regressions due to the RPC_TASK_CRED_NOREF flag
   - SUNRPC: Fix a use-after-free regression in rpc_free_client_work()
   - SUNRPC: Fix a race when tearing down the rpc client debugfs directory
   - SUNRPC: Signalled ASYNC tasks need to exit
   - NFSv3: fix rpc receive buffer size for MOUNT call"

* tag 'nfs-for-5.7-5' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFSv3: fix rpc receive buffer size for MOUNT call
  SUNRPC: 'Directory with parent 'rpc_clnt' already present!'
  NFS/pnfs: Don't use RPC_TASK_CRED_NOREF with pnfs
  NFS: Don't use RPC_TASK_CRED_NOREF with delegreturn
  SUNRPC: Signalled ASYNC tasks need to exit
  nfs: fix NULL deference in nfs4_get_valid_delegation
  SUNRPC: fix use-after-free in rpc_free_client_work()
  cachefiles: Fix race between read_waiter and read_copier involving op->to_do
  NFSv4: Fix fscache cookie aux_data to ensure change_attr is included
  NFS: Fix fscache super_cookie allocation
  NFS: Fix fscache super_cookie index_key from changing after umount
  cachefiles: Fix corruption of the return value in cachefiles_read_or_alloc_pages()
2020-05-15 14:03:13 -07:00
David S. Miller
da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Linus Torvalds
f85c1598dd Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix sk_psock reference count leak on receive, from Xiyu Yang.

 2) CONFIG_HNS should be invisible, from Geert Uytterhoeven.

 3) Don't allow locking route MTUs in ipv6, RFCs actually forbid this,
    from Maciej Żenczykowski.

 4) ipv4 route redirect backoff wasn't actually enforced, from Paolo
    Abeni.

 5) Fix netprio cgroup v2 leak, from Zefan Li.

 6) Fix infinite loop on rmmod in conntrack, from Florian Westphal.

 7) Fix tcp SO_RCVLOWAT hangs, from Eric Dumazet.

 8) Various bpf probe handling fixes, from Daniel Borkmann.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (68 commits)
  selftests: mptcp: pm: rm the right tmp file
  dpaa2-eth: properly handle buffer size restrictions
  bpf: Restrict bpf_trace_printk()'s %s usage and add %pks, %pus specifier
  bpf: Add bpf_probe_read_{user, kernel}_str() to do_refine_retval_range
  bpf: Restrict bpf_probe_read{, str}() only to archs where they work
  MAINTAINERS: Mark networking drivers as Maintained.
  ipmr: Add lockdep expression to ipmr_for_each_table macro
  ipmr: Fix RCU list debugging warning
  drivers: net: hamradio: Fix suspicious RCU usage warning in bpqether.c
  net: phy: broadcom: fix BCM54XX_SHD_SCR3_TRDDAPD value for BCM54810
  tcp: fix error recovery in tcp_zerocopy_receive()
  MAINTAINERS: Add Jakub to networking drivers.
  MAINTAINERS: another add of Karsten Graul for S390 networking
  drivers: ipa: fix typos for ipa_smp2p structure doc
  pppoe: only process PADT targeted at local interfaces
  selftests/bpf: Enforce returning 0 for fentry/fexit programs
  bpf: Enforce returning 0 for fentry/fexit progs
  net: stmmac: fix num_por initialization
  security: Fix the default value of secid_to_secctx hook
  libbpf: Fix register naming in PT_REGS s390 macros
  ...
2020-05-15 13:10:06 -07:00
Paolo Abeni
729cd6436f mptcp: cope better with MP_JOIN failure
Currently, on MP_JOIN failure we reset the child
socket, but leave the request socket untouched.

tcp_check_req will deal with it according to the
'tcp_abort_on_overflow' sysctl value - by default the
req socket will stay alive.

The above leads to inconsistent behavior on MP JOIN
failure, and bad listener overflow accounting.

This patch addresses the issue leveraging the infrastructure
just introduced to ask the TCP stack to drop the req on
failure.

The child socket is not freed anymore by subflow_syn_recv_sock(),
instead it's moved to a dead state and will be disposed by the
next sock_put done by the TCP stack, so that listener overflow
accounting is not affected by MP JOIN failure.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
Paolo Abeni
2f8a397d0a inet_connection_sock: factor out destroy helper.
Move the steps to prepare an inet_connection_sock for
forced disposal inside a separate helper. No functional
changes inteded, this will just simplify the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
Paolo Abeni
90bf45134d mptcp: add new sock flag to deal with join subflows
MP_JOIN subflows must not land into the accept queue.
Currently tcp_check_req() calls an mptcp specific helper
to detect such scenario.

Such helper leverages the subflow context to check for
MP_JOIN subflows. We need to deal also with MP JOIN
failures, even when the subflow context is not available
due allocation failure.

A possible solution would be changing the syn_recv_sock()
signature to allow returning a more descriptive action/
error code and deal with that in tcp_check_req().

Since the above need is MPTCP specific, this patch instead
uses a TCP request socket hole to add a MPTCP specific flag.
Such flag is used by the MPTCP syn_recv_sock() to tell
tcp_check_req() how to deal with the request socket.

This change is a no-op for !MPTCP build, and makes the
MPTCP code simpler. It allows also the next patch to deal
correctly with MP JOIN failure.

v1 -> v2:
 - be more conservative on drop_req initialization (Mat)

RFC -> v1:
 - move the drop_req bit inside tcp_request_sock (Eric)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
David S. Miller
3430223d39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-15

The following pull-request contains BPF updates for your *net-next* tree.

We've added 37 non-merge commits during the last 1 day(s) which contain
a total of 67 files changed, 741 insertions(+), 252 deletions(-).

The main changes are:

1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.

2) bpftool can probe CONFIG_HZ, from Daniel.

3) CAP_BPF is introduced to isolate user processes that use BPF infra and
   to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
   in certain cases, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:43:52 -07:00
Vlad Buslov
0348451db9 net: sched: cls_flower: implement terse dump support
Implement tcf_proto_ops->terse_dump() callback for flower classifier. Only
dump handle, flags and action data in terse mode.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:23:11 -07:00
Vlad Buslov
ca44b738e5 net: sched: implement terse dump support in act
Extend tcf_action_dump() with boolean argument 'terse' that is used to
request terse-mode action dump. In terse mode only essential data needed to
identify particular action (action kind, cookie, etc.) and its stats is put
to resulting skb and everything else is omitted. Implement
tcf_exts_terse_dump() helper in cls API that is intended to be used to
request terse dump of all exts (actions) attached to the filter.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:23:11 -07:00
Vlad Buslov
f8ab1807a9 net: sched: introduce terse dump flag
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
is intended to be used to improve performance of TC filter dump when
userland only needs to obtain stats and not the whole classifier/action
data. Extend struct tcf_proto_ops with new terse_dump() callback that must
be defined by supporting classifier implementations.

Support of the options in specific classifiers and actions is
implemented in following patches in the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:23:11 -07:00
Tobias Waldekranz
2e186a2cf8 net: core: recursively find netdev by device node
The assumption that a device node is associated either with the
netdev's device, or the parent of that device, does not hold for all
drivers. E.g. Freescale's DPAA has two layers of platform devices
above the netdev. Instead, recursively walk up the tree from the
netdev, allowing any parent to match against the sought after node.

Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:18:49 -07:00
Alexei Starovoitov
2c78ee898d bpf: Implement CAP_BPF
Implement permissions as stated in uapi/linux/capability.h
In order to do that the verifier allow_ptr_leaks flag is split
into four flags and they are set as:
  env->allow_ptr_leaks = bpf_allow_ptr_leaks();
  env->bypass_spec_v1 = bpf_bypass_spec_v1();
  env->bypass_spec_v4 = bpf_bypass_spec_v4();
  env->bpf_capable = bpf_capable();

The first three currently equivalent to perfmon_capable(), since leaking kernel
pointers and reading kernel memory via side channel attacks is roughly
equivalent to reading kernel memory with cap_perfmon.

'bpf_capable' enables bounded loops, precision tracking, bpf to bpf calls and
other verifier features. 'allow_ptr_leaks' enable ptr leaks, ptr conversions,
subtraction of pointers. 'bypass_spec_v1' disables speculative analysis in the
verifier, run time mitigations in bpf array, and enables indirect variable
access in bpf programs. 'bypass_spec_v4' disables emission of sanitation code
by the verifier.

That means that the networking BPF program loaded with CAP_BPF + CAP_NET_ADMIN
will have speculative checks done by the verifier and other spectre mitigation
applied. Such networking BPF program will not be able to leak kernel pointers
and will not be able to access arbitrary kernel memory.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200513230355.7858-3-alexei.starovoitov@gmail.com
2020-05-15 17:29:41 +02:00
Jesper Dangaard Brouer
bc56c919fc bpf: Add xdp.frame_sz in bpf_prog_test_run_xdp().
Update the memory requirements, when adding xdp.frame_sz in BPF test_run
function bpf_prog_test_run_xdp() which e.g. is used by XDP selftests.

Specifically add the expected reserved tailroom, but also allocated a
larger memory area to reflect that XDP frames usually comes in this
format. Limit the provided packet data size to 4096 minus headroom +
tailroom, as this also reflect a common 3520 bytes MTU limit with XDP.

Note that bpf_test_init already use a memory allocation method that clears
memory.  Thus, this already guards against leaking uninit kernel memory.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945349549.97035.15316291762482444006.stgit@firesoul
2020-05-14 21:21:56 -07:00
Jesper Dangaard Brouer
ddb47d518c xdp: Clear grow memory in bpf_xdp_adjust_tail()
Clearing memory of tail when grow happens, because it is too easy
to write a XDP_PASS program that extend the tail, which expose
this memory to users that can run tcpdump.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945349039.97035.5262100484553494.stgit@firesoul
2020-05-14 21:21:56 -07:00
Jesper Dangaard Brouer
c8741e2bfe xdp: Allow bpf_xdp_adjust_tail() to grow packet size
Finally, after all drivers have a frame size, allow BPF-helper
bpf_xdp_adjust_tail() to grow or extend packet size at frame tail.

Remember that helper/macro xdp_data_hard_end have reserved some
tailroom.  Thus, this helper makes sure that the BPF-prog don't have
access to this tailroom area.

V2: Remove one chicken check and use WARN_ONCE for other

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/158945348530.97035.12577148209134239291.stgit@firesoul
2020-05-14 21:21:56 -07:00
Jesper Dangaard Brouer
34cc0b338a xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frame
Use hole in struct xdp_frame, when adding member frame_sz, which keeps
same sizeof struct (32 bytes)

Drivers ixgbe and sfc had bug cases where the necessary/expected
tailroom was not reserved. This can lead to some hard to catch memory
corruption issues. Having the drivers frame_sz this can be detected when
packet length/end via xdp->data_end exceed the xdp_data_hard_end
pointer, which accounts for the reserved the tailroom.

When detecting this driver issue, simply fail the conversion with NULL,
which results in feedback to driver (failing xdp_do_redirect()) causing
driver to drop packet. Given the lack of consistent XDP stats, this can
be hard to troubleshoot. And given this is a driver bug, we want to
generate some more noise in form of a WARN stack dump (to ID the driver
code that inlined convert_to_xdp_frame).

Inlining the WARN macro is problematic, because it adds an asm
instruction (on Intel CPUs ud2) what influence instruction cache
prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this
and at the same time make identifying the function and line of this
inlined function easier.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337313.97035.10015729316710496600.stgit@firesoul
2020-05-14 21:21:54 -07:00
Jesper Dangaard Brouer
a075767bbd net: XDP-generic determining XDP frame size
The SKB "head" pointer points to the data area that contains
skb_shared_info, that can be found via skb_end_pointer(). Given
xdp->data_hard_start have been established (basically pointing to
skb->head), frame size is between skb_end_pointer() and data_hard_start,
plus the size reserved to skb_shared_info.

Change the bpf_xdp_adjust_tail offset adjust of skb->len, to be a positive
offset number on grow, and negative number on shrink.  As this seems more
natural when reading the code.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945336804.97035.7164852191163722056.stgit@firesoul
2020-05-14 21:21:54 -07:00
David S. Miller
d00f26b623 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-14

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.

2) support for narrow loads in bpf_sock_addr progs and additional
   helpers in cg-skb progs, from Andrey.

3) bpf benchmark runner, from Andrii.

4) arm and riscv JIT optimizations, from Luke.

5) bpf iterator infrastructure, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 20:31:21 -07:00
Andrey Ignatov
f307fa2cb4 bpf: Introduce bpf_sk_{, ancestor_}cgroup_id helpers
With having ability to lookup sockets in cgroup skb programs it becomes
useful to access cgroup id of retrieved sockets so that policies can be
implemented based on origin cgroup of such socket.

For example, a container running in a cgroup can have cgroup skb ingress
program that can lookup peer socket that is sending packets to a process
inside the container and decide whether those packets should be allowed
or denied based on cgroup id of the peer.

More specifically such ingress program can implement intra-host policy
"allow incoming packets only from this same container and not from any
other container on same host" w/o relying on source IP addresses since
quite often it can be the case that containers share same IP address on
the host.

Introduce two new helpers for this use-case: bpf_sk_cgroup_id() and
bpf_sk_ancestor_cgroup_id().

These helpers are similar to existing bpf_skb_{,ancestor_}cgroup_id
helpers with the only difference that sk is used to get cgroup id
instead of skb, and share code with them.

See documentation in UAPI for more details.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/f5884981249ce911f63e9b57ecd5d7d19154ff39.1589486450.git.rdna@fb.com
2020-05-14 18:41:07 -07:00
Andrey Ignatov
06d3e4c9f1 bpf: Allow skb_ancestor_cgroup_id helper in cgroup skb
cgroup skb programs already can use bpf_skb_cgroup_id. Allow
bpf_skb_ancestor_cgroup_id as well so that container policies can be
implemented for a container that can have sub-cgroups dynamically
created, but policies should still be implemented based on cgroup id of
container itself not on an id of a sub-cgroup.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/8874194d6041eba190356453ea9f6071edf5f658.1589486450.git.rdna@fb.com
2020-05-14 18:41:07 -07:00
Andrey Ignatov
d56c2f95ad bpf: Allow sk lookup helpers in cgroup skb
Currently sk lookup helpers are allowed in tc, xdp, sk skb, and cgroup
sock_addr programs.

But they would be useful in cgroup skb as well so that for example
cgroup skb ingress program can lookup a peer socket a packet comes from
on same host and make a decision whether to allow or deny this packet
based on the properties of that socket, e.g. cgroup that peer socket
belongs to.

Allow the following sk lookup helpers in cgroup skb:
* bpf_sk_lookup_tcp;
* bpf_sk_lookup_udp;
* bpf_sk_release;
* bpf_skc_lookup_tcp.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/f8c7ee280f1582b586629436d777b6db00597d63.1589486450.git.rdna@fb.com
2020-05-14 18:41:07 -07:00
Andrey Ignatov
7aebfa1b38 bpf: Support narrow loads from bpf_sock_addr.user_port
bpf_sock_addr.user_port supports only 4-byte load and it leads to ugly
code in BPF programs, like:

	volatile __u32 user_port = ctx->user_port;
	__u16 port = bpf_ntohs(user_port);

Since otherwise clang may optimize the load to be 2-byte and it's
rejected by verifier.

Add support for 1- and 2-byte loads same way as it's supported for other
fields in bpf_sock_addr like user_ip4, msg_src_ip4, etc.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/c1e983f4c17573032601d0b2b1f9d1274f24bc16.1589420814.git.rdna@fb.com
2020-05-14 18:30:57 -07:00
Amol Grover
7013908c2d ipmr: Add lockdep expression to ipmr_for_each_table macro
During the initialization process, ipmr_new_table() is called
to create new tables which in turn calls ipmr_get_table() which
traverses net->ipv4.mr_tables without holding the writer lock.
However, this is safe to do so as no tables exist at this time.
Hence add a suitable lockdep expression to silence the following
false-positive warning:

=============================
WARNING: suspicious RCU usage
5.7.0-rc3-next-20200428-syzkaller #0 Not tainted
-----------------------------
net/ipv4/ipmr.c:136 RCU-list traversed in non-reader section!!

ipmr_get_table+0x130/0x160 net/ipv4/ipmr.c:136
ipmr_new_table net/ipv4/ipmr.c:403 [inline]
ipmr_rules_init net/ipv4/ipmr.c:248 [inline]
ipmr_net_init+0x133/0x430 net/ipv4/ipmr.c:3089

Fixes: f0ad0860d0 ("ipv4: ipmr: support multiple tables")
Reported-by: syzbot+1519f497f2f9f08183c6@syzkaller.appspotmail.com
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 18:01:07 -07:00
Amol Grover
a14fbcd4f1 ipmr: Fix RCU list debugging warning
ipmr_for_each_table() macro uses list_for_each_entry_rcu()
for traversing outside of an RCU read side critical section
but under the protection of rtnl_mutex. Hence, add the
corresponding lockdep expression to silence the following
false-positive warning at boot:

[    4.319347] =============================
[    4.319349] WARNING: suspicious RCU usage
[    4.319351] 5.5.4-stable #17 Tainted: G            E
[    4.319352] -----------------------------
[    4.319354] net/ipv4/ipmr.c:1757 RCU-list traversed in non-reader section!!

Fixes: f0ad0860d0 ("ipv4: ipmr: support multiple tables")
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 18:01:07 -07:00
Jakub Kicinski
5a46b062e2 devlink: refactor end checks in devlink_nl_cmd_region_read_dumpit
Clean up after recent fixes, move address calculations
around and change the variable init, so that we can have
just one start_offset == end_offset check.

Make the check a little stricter to preserve the -EINVAL
error if requested start offset is larger than the region
itself.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 17:36:25 -07:00
Eric Dumazet
e776af608f tcp: fix error recovery in tcp_zerocopy_receive()
If user provides wrong virtual address in TCP_ZEROCOPY_RECEIVE
operation we want to return -EINVAL error.

But depending on zc->recv_skip_hint content, we might return
-EIO error if the socket has SOCK_DONE set.

Make sure to return -EINVAL in this case.

BUG: KMSAN: uninit-value in tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
BUG: KMSAN: uninit-value in do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
CPU: 1 PID: 625 Comm: syz-executor.0 Not tainted 5.7.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 tcp_zerocopy_receive net/ipv4/tcp.c:1833 [inline]
 do_tcp_getsockopt+0x4494/0x6320 net/ipv4/tcp.c:3685
 tcp_getsockopt+0xf8/0x1f0 net/ipv4/tcp.c:3728
 sock_common_getsockopt+0x13f/0x180 net/core/sock.c:3131
 __sys_getsockopt+0x533/0x7b0 net/socket.c:2177
 __do_sys_getsockopt net/socket.c:2192 [inline]
 __se_sys_getsockopt+0xe1/0x100 net/socket.c:2189
 __x64_sys_getsockopt+0x62/0x80 net/socket.c:2189
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45c829
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f1deeb72c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000037
RAX: ffffffffffffffda RBX: 00000000004e01e0 RCX: 000000000045c829
RDX: 0000000000000023 RSI: 0000000000000006 RDI: 0000000000000009
RBP: 000000000078bf00 R08: 0000000020000200 R09: 0000000000000000
R10: 00000000200001c0 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000001d8 R14: 00000000004d3038 R15: 00007f1deeb736d4

Local variable ----zc@do_tcp_getsockopt created at:
 do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670
 do_tcp_getsockopt+0x1a74/0x6320 net/ipv4/tcp.c:3670

Fixes: 05255b823a ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 15:12:08 -07:00
J. Bruce Fields
933496e9cc SUNRPC: 'Directory with parent 'rpc_clnt' already present!'
Each rpc_client has a cl_clid which is allocated from a global ida, and
a debugfs directory which is named after cl_clid.

We're releasing the cl_clid before we free the debugfs directory named
after it.  As soon as the cl_clid is released, that value is available
for another newly created client.

That leaves a window where another client may attempt to create a new
debugfs directory with the same name as the not-yet-deleted debugfs
directory from the dying client.  Symptoms are log messages like

	Directory 4 with parent 'rpc_clnt' already present!

Fixes: 7c4310ff56 "SUNRPC: defer slow parts of rpc_free_client() to a workqueue."
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-14 16:31:09 -04:00
David S. Miller
1b54f4fa4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix gcc-10 compilation warning in nf_conntrack, from Arnd Bergmann.

2) Add NF_FLOW_HW_PENDING to avoid races between stats and deletion
   commands, from Paul Blakey.

3) Remove WQ_MEM_RECLAIM from the offload workqueue, from Roi Dayan.

4) Infinite loop when removing nf_conntrack module, from Florian Westphal.

5) Set NF_FLOW_TEARDOWN bit on expiration to avoid races when refreshing
   the timeout from the software path.

6) Missing nft_set_elem_expired() check in the rbtree, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 13:15:02 -07:00
Xin Long
56b1b7c667 esp6: calculate transport_header correctly when sel.family != AF_INET6
In esp6_init_state() for beet mode when x->sel.family != AF_INET6:

  x->props.header_len = sizeof(struct ip_esp_hdr) +
     crypto_aead_ivsize(aead) + IPV4_BEET_PHMAXLEN +
     (sizeof(struct ipv6hdr) - sizeof(struct iphdr))

In xfrm6_beet_gso_segment() skb->transport_header is supposed to move
to the end of the ph header for IPPROTO_BEETPH, so if x->sel.family !=
AF_INET6 and it's IPPROTO_BEETPH, it should do:

   skb->transport_header -=
      (sizeof(struct ipv6hdr) - sizeof(struct iphdr));
   skb->transport_header += ph->hdrlen * 8;

And IPV4_BEET_PHMAXLEN is only reserved for PH header, so if
x->sel.family != AF_INET6 and it's not IPPROTO_BEETPH, it should do:

   skb->transport_header -=
      (sizeof(struct ipv6hdr) - sizeof(struct iphdr));
   skb->transport_header -= IPV4_BEET_PHMAXLEN;

Thanks Sabrina for looking deep into this issue.

Fixes: 7f9e40eb18 ("esp6: add gso_segment for esp6 beet mode")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-14 12:13:05 +02:00
Christoph Hellwig
1b2f08df0a ipv6: set msg_control_is_user in do_ipv6_getsockopt
While do_ipv6_getsockopt does not call the high-level recvmsg helper,
the msghdr eventually ends up being passed to put_cmsg anyway, and thus
needs msg_control_is_user set to the proper value.

Fixes: 1f466e1f15 ("net: cleanly handle kernel vs user buffers for ->msg_control")
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 13:11:45 -07:00
Tuong Lien
88690b1079 tipc: fix failed service subscription deletion
When a service subscription is expired or canceled by user, it needs to
be deleted from the subscription list, so that new subscriptions can be
registered (max = 65535 per net). However, there are two issues in code
that can cause such an unused subscription to persist:

1) The 'tipc_conn_delete_sub()' has a loop on the subscription list but
it makes a break shortly when the 1st subscription differs from the one
specified, so the subscription will not be deleted.

2) In case a subscription is canceled, the code to remove the
'TIPC_SUB_CANCEL' flag from the subscription filter does not work if it
is a local subscription (i.e. the little endian isn't involved). So, it
will be no matches when looking for the subscription to delete later.

The subscription(s) will be removed eventually when the user terminates
its topology connection but that could be a long time later. Meanwhile,
the number of available subscriptions may be exhausted.

This commit fixes the two issues above, so as needed a subscription can
be deleted correctly.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 12:33:19 -07:00
Tuong Lien
0771d7df81 tipc: fix memory leak in service subscripting
Upon receipt of a service subscription request from user via a topology
connection, one 'sub' object will be allocated in kernel, so it will be
able to send an event of the service if any to the user correspondingly
then. Also, in case of any failure, the connection will be shutdown and
all the pertaining 'sub' objects will be freed.

However, there is a race condition as follows resulting in memory leak:

       receive-work       connection        send-work
              |                |                |
        sub-1 |<------//-------|                |
        sub-2 |<------//-------|                |
              |                |<---------------| evt for sub-x
        sub-3 |<------//-------|                |
              :                :                :
              :                :                :
              |       /--------|                |
              |       |        * peer closed    |
              |       |        |                |
              |       |        |<-------X-------| evt for sub-y
              |       |        |<===============|
        sub-n |<------/        X    shutdown    |
    -> orphan |                                 |

That is, the 'receive-work' may get the last subscription request while
the 'send-work' is shutting down the connection due to peer close.

We had a 'lock' on the connection, so the two actions cannot be carried
out simultaneously. If the last subscription is allocated e.g. 'sub-n',
before the 'send-work' closes the connection, there will be no issue at
all, the 'sub' objects will be freed. In contrast the last subscription
will become orphan since the connection was closed, and we released all
references.

This commit fixes the issue by simply adding one test if the connection
remains in 'connected' state right after we obtain the connection lock,
then a subscription object can be created as usual, otherwise we ignore
it.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Reported-by: Thang Ngo <thang.h.ngo@dektech.com.au>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 12:33:18 -07:00
Tuong Lien
c726858945 tipc: fix large latency in smart Nagle streaming
Currently when a connection is in Nagle mode, we set the 'ack_required'
bit in the last sending buffer and wait for the corresponding ACK prior
to pushing more data. However, on the receiving side, the ACK is issued
only when application really  reads the whole data. Even if part of the
last buffer is received, we will not do the ACK as required. This might
cause an unnecessary delay since the receiver does not always fetch the
message as fast as the sender, resulting in a large latency in the user
message sending, which is: [one RTT + the receiver processing time].

The commit makes Nagle ACK as soon as possible i.e. when a message with
the 'ack_required' arrives in the receiving side's stack even before it
is processed or put in the socket receive queue...
This way, we can limit the streaming latency to one RTT as committed in
Nagle mode.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 12:33:18 -07:00
Christoph Hellwig
6e8a4f9dda net: ignore sock_from_file errors in __scm_install_fd
The code had historically been ignoring these errors, and my recent
refactoring changed that, which broke ssh in some setups.

Fixes: 2618d530dd ("net/scm: cleanup scm_detach_fds")
Reported-by: Ido Schimmel <idosch@idosch.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 12:30:54 -07:00
Yonghong Song
3c32cc1bce bpf: Enable bpf_iter targets registering ctx argument types
Commit b121b341e5 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.

This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song
ab2ee4fcb9 bpf: Change func bpf_iter_unreg_target() signature
Change func bpf_iter_unreg_target() parameter from target
name to target reg_info, similar to bpf_iter_reg_target().

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180220.2949737-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
Yonghong Song
15172a46fa bpf: net: Refactor bpf_iter target registration
Currently bpf_iter_reg_target takes parameters from target
and allocates memory to save them. This is really not
necessary, esp. in the future we may grow information
passed from targets to bpf_iter manager.

The patch refactors the code so target reg_info
becomes static and bpf_iter manager can just take
a reference to it.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200513180219.2949605-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
David S. Miller
6cd35888a0 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-05-13

Here's a second attempt at a bluetooth-next pull request which
supercedes the one dated 2020-05-09. This should have the issues
discovered by Jakub fixed.

 - Add support for Intel Typhoon Peak device (8087:0032)
 - Add device tree bindings for Realtek RTL8723BS device
 - Add device tree bindings for Qualcomm QCA9377 device
 - Add support for experimental features configuration through mgmt
 - Add driver hook to prevent wake from suspend
 - Add support for waiting for L2CAP disconnection response
 - Multiple fixes & cleanups to the btbcm driver
 - Add support for LE scatternet topology for selected devices
 - A few other smaller fixes & cleanups

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 12:20:12 -07:00
Archie Pusaka
5b440676c1 Bluetooth: L2CAP: add support for waiting disconnection resp
Whenever we disconnect a L2CAP connection, we would immediately
report a disconnection event (EPOLLHUP) to the upper layer, without
waiting for the response of the other device.

This patch offers an option to wait until we receive a disconnection
response before reporting disconnection event, by using the "how"
parameter in l2cap_sock_shutdown(). Therefore, upper layer can opt
to wait for disconnection response by shutdown(sock, SHUT_WR).

This can be used to enforce proper disconnection order in HID,
where the disconnection of the interrupt channel must be complete
before attempting to disconnect the control channel.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 10:03:51 +02:00
Sonny Sasaka
adf1d69264 Bluetooth: Handle Inquiry Cancel error after Inquiry Complete
After sending Inquiry Cancel command to the controller, it is possible
that Inquiry Complete event comes before Inquiry Cancel command complete
event. In this case the Inquiry Cancel command will have status of
Command Disallowed since there is no Inquiry session to be cancelled.
This case should not be treated as error, otherwise we can reach an
inconsistent state.

Example of a btmon trace when this happened:

< HCI Command: Inquiry Cancel (0x01|0x0002) plen 0
> HCI Event: Inquiry Complete (0x01) plen 1
        Status: Success (0x00)
> HCI Event: Command Complete (0x0e) plen 4
      Inquiry Cancel (0x01|0x0002) ncmd 1
        Status: Command Disallowed (0x0c)

Signed-off-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 09:35:17 +02:00
Abhishek Pandit-Subedi
81dafad53c Bluetooth: Add hook for driver to prevent wake from suspend
Let drivers have a hook to disable configuring scanning during suspend.
Drivers should use the device_may_wakeup function call to determine
whether hci should be configured for wakeup.

For example, an implementation for btusb may look like the following:

  bool btusb_prevent_wake(struct hci_dev *hdev)
  {
        struct btusb_data *data = hci_get_drvdata(hdev);
        return !device_may_wakeup(&data->udev->dev);
  }

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 09:12:04 +02:00
Abhishek Pandit-Subedi
0d2c9825e4 Bluetooth: Rename BT_SUSPEND_COMPLETE
Renamed BT_SUSPEND_COMPLETE to BT_SUSPEND_CONFIGURE_WAKE since it sets
up the event filter and whitelist for wake-up.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 09:12:04 +02:00
Abhishek Pandit-Subedi
91779665c1 Bluetooth: Modify LE window and interval for suspend
When a device is suspended, it doesn't need to be as responsive to
connection events. Increase the interval to 640ms (creating a duty cycle
of roughly 1.75%) so that passive scanning uses much less power (vs
previous duty cycle of 18.75%). The new window + interval combination
has been tested to work with HID devices (which are currently the only
devices capable of wake up).

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 08:50:56 +02:00
Abhishek Pandit-Subedi
aaebf8e608 Bluetooth: Fix incorrect type for window and interval
The types for window and interval should be uint16, not uint8.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 08:50:56 +02:00
Paolo Abeni
eead1c2ea2 netlabel: cope with NULL catmap
The cipso and calipso code can set the MLS_CAT attribute on
successful parsing, even if the corresponding catmap has
not been allocated, as per current configuration and external
input.

Later, selinux code tries to access the catmap if the MLS_CAT flag
is present via netlbl_catmap_getlong(). That may cause null ptr
dereference while processing incoming network traffic.

Address the issue setting the MLS_CAT flag only if the catmap is
really allocated. Additionally let netlbl_catmap_getlong() cope
with NULL catmap.

Reported-by: Matthew Sheets <matthew.sheets@gd-ms.com>
Fixes: 4b8feff251 ("netlabel: fix the horribly broken catmap functions")
Fixes: ceba1832b1 ("calipso: Set the calipso socket label to match the secattr.")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 18:12:40 -07:00
Vladimir Oltean
fb9f2e9286 net: dsa: tag_sja1105: appease sparse checks for ethertype accessors
A comparison between a value from the packet and an integer constant
value needs to be done by converting the value from the packet from
net->host, or the constant from host->net. Not the other way around.
Even though it makes no practical difference, correct that.

Fixes: 38b5beeae7 ("net: dsa: sja1105: prepare tagger for handling DSA tags and VLAN simultaneously")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 18:02:42 -07:00
William Tu
51fa960d3b erspan: Check IFLA_GRE_ERSPAN_VER is set.
Add a check to make sure the IFLA_GRE_ERSPAN_VER is provided by users.

Fixes: f989d546a2 ("erspan: Add type I version 0 support.")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:11:41 -07:00
Vladimir Oltean
84eeb5d460 net: dsa: tag_sja1105: implement sub-VLAN decoding
Create a subvlan_map as part of each port's tagger private structure.
This keeps reverse mappings of bridge-to-dsa_8021q VLAN retagging rules.

Note that as of this patch, this piece of code is never engaged, due to
the fact that the driver hasn't installed any retagging rule, so we'll
always see packets with a subvlan code of 0 (untagged).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:08 -07:00
Vladimir Oltean
3eaae1d05f net: dsa: tag_8021q: support up to 8 VLANs per port using sub-VLANs
For switches that support VLAN retagging, such as sja1105, we extend
dsa_8021q by encoding a "sub-VLAN" into the remaining 3 free bits in the
dsa_8021q tag.

A sub-VLAN is nothing more than a number in the range 0-7, which serves
as an index into a per-port driver lookup table. The sub-VLAN value of
zero means that traffic is untagged (this is also backwards-compatible
with dsa_8021q without retagging).

The switch should be configured to retag VLAN-tagged traffic that gets
transmitted towards the CPU port (and towards the CPU only). Example:

bridge vlan add dev sw1p0 vid 100

The switch retags frames received on port 0, going to the CPU, and
having VID 100, to the VID of 1104 (0x0450). In dsa_8021q language:

 | 11  | 10  |  9  |  8  |  7  |  6  |  5  |  4  |  3  |  2  |  1  |  0  |
 +-----------+-----+-----------------+-----------+-----------------------+
 |    DIR    | SVL |    SWITCH_ID    |  SUBVLAN  |          PORT         |
 +-----------+-----+-----------------+-----------+-----------------------+

0x0450 means:
 - DIR = 0b01: this is an RX VLAN
 - SUBVLAN = 0b001: this is subvlan #1
 - SWITCH_ID = 0b001: this is switch 1 (see the name "sw1p0")
 - PORT = 0b0000: this is port 0 (see the name "sw1p0")

The driver also remembers the "1 -> 100" mapping. In the hotpath, if the
sub-VLAN from the tag encodes a non-untagged frame, this mapping is used
to create a VLAN hwaccel tag, with the value of 100.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:08 -07:00
Vladimir Oltean
38b5beeae7 net: dsa: sja1105: prepare tagger for handling DSA tags and VLAN simultaneously
In VLAN-unaware mode, sja1105 uses VLAN tags with a custom TPID of
0xdadb. While in the yet-to-be introduced best_effort_vlan_filtering
mode, it needs to work with normal VLAN TPID values.

A complication arises when we must transmit a VLAN-tagged packet to the
switch when it's in VLAN-aware mode. We need to construct a packet with
2 VLAN tags, and the switch will use the outer header for routing and
pop it on egress. But sadly, here the 2 hardware generations don't
behave the same:

- E/T switches won't pop an ETH_P_8021AD tag on egress, it seems
  (packets will remain double-tagged).
- P/Q/R/S switches will drop a packet with 2 ETH_P_8021Q tags (it looks
  like it tries to prevent VLAN hopping).

But looks like the reverse is also true:

- E/T switches have no problem popping the outer tag from packets with
  2 ETH_P_8021Q tags.
- P/Q/R/S will have no problem popping a single tag even if that is
  ETH_P_8021AD.

So it is clear that if we want the hardware to work with dsa_8021q
tagging in VLAN-aware mode, we need to send different TPIDs depending on
revision. Keep that information in priv->info->qinq_tpid.

The per-port tagger structure will hold an xmit_tpid value that depends
not only upon the qinq_tpid, but also upon the VLAN awareness state
itself (in case we must transmit using 0xdadb).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:08 -07:00
Vladimir Oltean
ec5ae61076 net: dsa: sja1105: save/restore VLANs using a delta commit method
Managing the VLAN table that is present in hardware will become very
difficult once we add a third operating state
(best_effort_vlan_filtering). That is because correct cleanup (not too
little, not too much) becomes virtually impossible, when VLANs can be
added from the bridge layer, from dsa_8021q for basic tagging, for
cross-chip bridging, as well as retagging rules for sub-VLANs and
cross-chip sub-VLANs. So we need to rethink VLAN interaction with the
switch in a more scalable way.

In preparation for that, use the priv->expect_dsa_8021q boolean to
classify any VLAN request received through .port_vlan_add or
.port_vlan_del towards either one of 2 internal lists: bridge VLANs and
dsa_8021q VLANs.

Then, implement a central sja1105_build_vlan_table method that creates a
VLAN configuration from scratch based on the 2 lists of VLANs kept by
the driver, and based on the VLAN awareness state. Currently, if we are
VLAN-unaware, install the dsa_8021q VLANs, otherwise the bridge VLANs.

Then, implement a delta commit procedure that identifies which VLANs
from this new configuration are actually different from the config
previously committed to hardware. We apply the delta through the dynamic
configuration interface (we don't reset the switch). The result is that
the hardware should see the exact sequence of operations as before this
patch.

This also helps remove the "br" argument passed to
dsa_8021q_crosschip_bridge_join, which it was only using to figure out
whether it should commit the configuration back to us or not, based on
the VLAN awareness state of the bridge. We can simplify that, by always
allowing those VLANs inside of our dsa_8021q_vlans list, and committing
those to hardware when necessary.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:08 -07:00
Vladimir Oltean
1f66b0f0ae net: dsa: tag_8021q: introduce a vid_is_dsa_8021q helper
This function returns a boolean denoting whether the VLAN passed as
argument is part of the 1024-3071 range that the dsa_8021q tagging
scheme uses.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:07 -07:00
Russell King
54a0ed0df4 net: dsa: provide an option for drivers to always receive bridge VLANs
DSA assumes that a bridge which has vlan filtering disabled is not
vlan aware, and ignores all vlan configuration. However, the kernel
software bridge code allows configuration in this state.

This causes the kernel's idea of the bridge vlan state and the
hardware state to disagree, so "bridge vlan show" indicates a correct
configuration but the hardware lacks all configuration. Even worse,
enabling vlan filtering on a DSA bridge immediately blocks all traffic
which, given the output of "bridge vlan show", is very confusing.

Provide an option that drivers can set to indicate they want to receive
vlan configuration even when vlan filtering is disabled. At the very
least, this is safe for Marvell DSA bridges, which do not look up
ingress traffic in the VTU if the port is in 8021Q disabled state. It is
also safe for the Ocelot switch family. Whether this change is suitable
for all DSA bridges is not known.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:07 -07:00
Eric Dumazet
24adbc1676 tcp: fix SO_RCVLOWAT hangs with fat skbs
We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
overhead in tcp_set_rcvlowat()

This works well when skb->len/skb->truesize ratio is bigger than 0.5

But if we receive packets with small MSS, we can end up in a situation
where not enough bytes are available in the receive queue to satisfy
RCVLOWAT setting.
As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
preventing remote peer from sending more data.

Even autotuning does not help, because it only triggers at the time
user process drains the queue. If no EPOLLIN is generated, this
can not happen.

Note poll() has a similar issue, after commit
c7004482e8 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 12:49:47 -07:00
Christoph Paasch
64d950ae0b mptcp: Initialize map_seq upon subflow establishment
When the other MPTCP-peer uses 32-bit data-sequence numbers, we rely on
map_seq to indicate how to expand to a 64-bit data-sequence number in
expand_seq() when receiving data.

For new subflows, this field is not initialized, thus results in an
"invalid" mapping being discarded.

Fix this by initializing map_seq upon subflow establishment time.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 12:08:22 -07:00
Phil Sutter
340eaff651 netfilter: nft_set_rbtree: Add missing expired checks
Expired intervals would still match and be dumped to user space until
garbage collection wiped them out. Make sure they stop matching and
disappear (from users' perspective) as soon as they expire.

Fixes: 8d8540c4f5 ("netfilter: nft_set_rbtree: add timeout support")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-12 13:19:34 +02:00
Pablo Neira Ayuso
9ed81c8e0d netfilter: flowtable: set NF_FLOW_TEARDOWN flag on entry expiration
If the flow timer expires, the gc sets on the NF_FLOW_TEARDOWN flag.
Otherwise, the flowtable software path might race to refresh the
timeout, leaving the state machine in inconsistent state.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Reported-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-12 13:19:08 +02:00
Xiongfeng Wang
746c6237ec sunrpc: add missing newline when printing parameter 'pool_mode' by sysfs
When I cat parameter '/sys/module/sunrpc/parameters/pool_mode', it
displays as follows. It is better to add a newline for easy reading.

[root@hulk-202 ~]# cat /sys/module/sunrpc/parameters/pool_mode
global[root@hulk-202 ~]#

Signed-off-by: Xiongfeng Wang <wangxiongfeng2@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-11 20:20:40 -04:00
Christoph Hellwig
1f466e1f15 net: cleanly handle kernel vs user buffers for ->msg_control
The msg_control field in struct msghdr can either contain a user
pointer when used with the recvmsg system call, or a kernel pointer
when used with sendmsg.  To complicate things further kernel_recvmsg
can stuff a kernel pointer in and then use set_fs to make the uaccess
helpers accept it.

Replace it with a union of a kernel pointer msg_control field, and
a user pointer msg_control_user one, and allow kernel_recvmsg operate
on a proper kernel pointer using a bitfield to override the normal
choice of a user pointer for recvmsg.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-11 16:59:16 -07:00
Christoph Hellwig
2618d530dd net/scm: cleanup scm_detach_fds
Factor out two helpes to keep the code tidy.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-11 16:59:16 -07:00
Christoph Hellwig
0462b6bdb6 net: add a CMSG_USER_DATA macro
Add a variant of CMSG_DATA that operates on user pointer to avoid
sparse warnings about casting to/from user pointers.  Also fix up
CMSG_DATA to rely on the gcc extension that allows void pointer
arithmetics to cut down on the amount of casts.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-11 16:59:16 -07:00
Florian Fainelli
097f024454 net: dsa: tag_sja1105: Constify dsa_device_ops
sja1105_netdev_ops should be const since that is what the DSA layer
expects.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-11 16:50:45 -07:00
Florian Fainelli
2fa3888bb7 net: dsa: ocelot: Constify dsa_device_ops
ocelot_netdev_ops should be const since that is what the DSA layer
expects.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-11 16:50:45 -07:00
Linus Torvalds
152036d137 Fixes:
- Resolve a data integrity problem with NFSD that I inadvertently
 introduced last year. The change I made makes the NFS server's
 duplicate reply cache ineffective when krb5i or krb5p are in use,
 thus allowing the replay of non-idempotent NFS requests such as
 RENAME, SETATTR, or even WRITEs.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJerCuyAAoJEDNqszNvZn+XxvAQAJmUW5412OO7mkI2IW5PDP71
 ZnBAuTs4UpLBgp1VpS3ai0LYnOX9o8WLqolzGuFxGfK69ZZdh7U7fzX2aEytoTSP
 KkW3dNo+NzRppWOhMBEfMBLnAu22YF+F689RvwEqd0C1AgGugaFfzlF1ECrJVpA7
 g1WVhTi0ihfArhzSWTWO4LiuwjRd5TNF8gEci2j3DuHn1Hp6BagbKOv0rFdgK99X
 BbK8IaEalBUjtpGAPgRU/WY/WznzhgARVeOX7Rh/P/zFdFB1G1M4kycaadBk6uaU
 SHbdWBwDsYatDNuhZUI3Wv2g+DQ5LJRrjNNesLRot+kC3XD12sBCMsSI3owoz7Jt
 u0s48YmOJO8uWi4kDenR9XV8bAaDmX7R/+XGZm1lethNrpBKat9EIrqSHNvqAXZ4
 b3cC8/A/aCcOrWXtZnWqvJdqjx2EgL6DbcpaFheaPEekRofuiyOaAbXdlJQvzcwY
 Sv4EC4ymABpQRg0si+Sya5Int7bZ9ryLZTSCMiLA+L1TnoW26XjMlGAaRqYi7Tx7
 Qg4Bt400IIDE0FlE/76vE7b7YWQj7GfErA6moIyDio5AInRU9sHDFyB8iCfdpKxh
 ajNl1NuEO/FSoXOGQvOo1uHD0vKvNVK21T6vQsRCT1f6JXtpiwTn6eLX4Wn9YLdI
 iKqg2YXfdCbJnAuoxzGi
 =hT3x
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd fixes from Chuck Lever:
 "Resolve a data integrity problem with NFSD that I inadvertently
  introduced last year.

  The change I made makes the NFS server's duplicate reply cache
  ineffective when krb5i or krb5p are in use, thus allowing the replay
  of non-idempotent NFS requests such as RENAME, SETATTR, or even
  WRITEs"

* tag 'nfsd-5.7-rc-2' of git://git.linux-nfs.org/projects/cel/cel-2.6:
  SUNRPC: Revert 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
  SUNRPC: Fix GSS privacy computation of auth->au_ralign
  SUNRPC: Add "@len" parameter to gss_unwrap()
2020-05-11 12:04:52 -07:00
Chuck Lever
ce99aa62e1 SUNRPC: Signalled ASYNC tasks need to exit
Ensure that signalled ASYNC rpc_tasks exit immediately instead of
spinning until a timeout (or forever).

To avoid checking for the signal flag on every scheduler iteration,
the check is instead introduced in the client's finite state
machine.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Fixes: ae67bd3821 ("SUNRPC: Fix up task signalling")
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-11 14:06:50 -04:00
Florian Westphal
54ab49fde9 netfilter: conntrack: fix infinite loop on rmmod
'rmmod nf_conntrack' can hang forever, because the netns exit
gets stuck in nf_conntrack_cleanup_net_list():

i_see_dead_people:
 busy = 0;
 list_for_each_entry(net, net_exit_list, exit_list) {
  nf_ct_iterate_cleanup(kill_all, net, 0, 0);
  if (atomic_read(&net->ct.count) != 0)
   busy = 1;
 }
 if (busy) {
  schedule();
  goto i_see_dead_people;
 }

When nf_ct_iterate_cleanup iterates the conntrack table, all nf_conn
structures can be found twice:
once for the original tuple and once for the conntracks reply tuple.

get_next_corpse() only calls the iterator when the entry is
in original direction -- the idea was to avoid unneeded invocations
of the iterator callback.

When support for clashing entries was added, the assumption that
all nf_conn objects are added twice, once in original, once for reply
tuple no longer holds -- NF_CLASH_BIT entries are only added in
the non-clashing reply direction.

Thus, if at least one NF_CLASH entry is in the list then
nf_conntrack_cleanup_net_list() always skips it completely.

During normal netns destruction, this causes a hang of several
seconds, until the gc worker removes the entry (NF_CLASH entries
always have a 1 second timeout).

But in the rmmod case, the gc worker has already been stopped, so
ct.count never becomes 0.

We can fix this in two ways:

1. Add a second test for CLASH_BIT and call iterator for those
   entries as well, or:
2. Skip the original tuple direction and use the reply tuple.

2) is simpler, so do that.

Fixes: 6a757c07e5 ("netfilter: conntrack: allow insertion of clashing entries")
Reported-by: Chen Yi <yiche@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-11 17:46:24 +02:00
Roi Dayan
1d10da0eb0 netfilter: flowtable: Remove WQ_MEM_RECLAIM from workqueue
This workqueue is in charge of handling offloaded flow tasks like
add/del/stats we should not use WQ_MEM_RECLAIM flag.
The flag can result in the following warning.

[  485.557189] ------------[ cut here ]------------
[  485.562976] workqueue: WQ_MEM_RECLAIM nf_flow_table_offload:flow_offload_worr
[  485.562985] WARNING: CPU: 7 PID: 3731 at kernel/workqueue.c:2610 check_flush0
[  485.590191] Kernel panic - not syncing: panic_on_warn set ...
[  485.597100] CPU: 7 PID: 3731 Comm: kworker/u112:8 Not tainted 5.7.0-rc1.21802
[  485.606629] Hardware name: Dell Inc. PowerEdge R730/072T6D, BIOS 2.4.3 01/177
[  485.615487] Workqueue: nf_flow_table_offload flow_offload_work_handler [nf_f]
[  485.624834] Call Trace:
[  485.628077]  dump_stack+0x50/0x70
[  485.632280]  panic+0xfb/0x2d7
[  485.636083]  ? check_flush_dependency+0x110/0x130
[  485.641830]  __warn.cold.12+0x20/0x2a
[  485.646405]  ? check_flush_dependency+0x110/0x130
[  485.652154]  ? check_flush_dependency+0x110/0x130
[  485.657900]  report_bug+0xb8/0x100
[  485.662187]  ? sched_clock_cpu+0xc/0xb0
[  485.666974]  do_error_trap+0x9f/0xc0
[  485.671464]  do_invalid_op+0x36/0x40
[  485.675950]  ? check_flush_dependency+0x110/0x130
[  485.681699]  invalid_op+0x28/0x30

Fixes: 7da182a998 ("netfilter: flowtable: Use work entry per offload command")
Reported-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-11 17:45:59 +02:00
David Howells
c410bf0193 rxrpc: Fix the excessive initial retransmission timeout
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
sufficiently sampled.  This can cause problems with some fileservers with
calls to the cache manager in the afs filesystem being dropped from the
fileserver because a packet goes missing and the retransmission timeout is
greater than the call expiry timeout.

Fix this by:

 (1) Copying the RTT/RTO calculation code from Linux's TCP implementation
     and altering it to fit rxrpc.

 (2) Altering the various users of the RTT to make use of the new SRTT
     value.

 (3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
     value instead (which is needed in jiffies), along with a backoff.

Notes:

 (1) rxrpc provides RTT samples by matching the serial numbers on outgoing
     DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
     against the reference serial number in incoming REQUESTED ACK and
     PING-RESPONSE ACK packets.

 (2) Each packet that is transmitted on an rxrpc connection gets a new
     per-connection serial number, even for retransmissions, so an ACK can
     be cross-referenced to a specific trigger packet.  This allows RTT
     information to be drawn from retransmitted DATA packets also.

 (3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
     on an rxrpc_call because many RPC calls won't live long enough to
     generate more than one sample.

 (4) The calculated SRTT value is in units of 8ths of a microsecond rather
     than nanoseconds.

The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.

Fixes: 17926a7932 ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-11 16:42:28 +01:00
Paul Blakey
2c8897953f netfilter: flowtable: Add pending bit for offload work
Gc step can queue offloaded flow del work or stats work.
Those work items can race each other and a flow could be freed
before the stats work is executed and querying it.
To avoid that, add a pending bit that if a work exists for a flow
don't queue another work for it.
This will also avoid adding multiple stats works in case stats work
didn't complete but gc step started again.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-11 16:26:33 +02:00
Florian Westphal
7d4343d501 xfrm: fix unused variable warning if CONFIG_NETFILTER=n
After recent change 'x' is only used when CONFIG_NETFILTER is set:

net/ipv4/xfrm4_output.c: In function '__xfrm4_output':
net/ipv4/xfrm4_output.c:19:21: warning: unused variable 'x' [-Wunused-variable]
   19 |  struct xfrm_state *x = skb_dst(skb)->xfrm;

Expand the CONFIG_NETFILTER scope to avoid this.

Fixes: 2ab6096db2 ("xfrm: remove output_finish indirection from xfrm_state_afinfo")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-11 15:12:27 +02:00
Marcel Holtmann
e625e50cee Bluetooth: Introduce debug feature when dynamic debug is disabled
In case dynamic debug is disabled, this feature allows a vendor platform
to provide debug statement printing.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:16:27 +02:00
Marcel Holtmann
a10c907ce0 Bluetooth: Add support for experimental features configuration
To enable platform specific experimental features, introduce this new set of
management commands and events.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Marcel Holtmann
568602457c Bluetooth: Replace BT_DBG with bt_dev_dbg for security manager support
The security manager operates on a specific controller and thus use
bt_dev_dbg to indetify the controller for each debug message.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Marcel Holtmann
d5cc6626b3 Bluetooth: Introduce HCI_MGMT_HDEV_OPTIONAL option
When setting HCI_MGMT_HDEV_OPTIONAL it is possible to target a specific
conntroller or a global interface.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Marcel Holtmann
181d695352 Bluetooth: Replace BT_DBG with bt_dev_dbg for management support
The majority of management interaction are based on a controller index
and have a hci_dev associated with it. So use bt_dev_dbg to have a clean
way of indentifying the controller the debug message belongs to.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Marcel Holtmann
5f4b91728b Bluetooth: Add MGMT_EV_PHY_CONFIGURATION_CHANGED to supported list
The event MGMT_EV_PHY_CONFIGURATION_CHANGED wasn't listed in the list of
supported events. So add it.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Konstantin Forostyan
69d67b461a Bluetooth: L2CAP: Fix errors during L2CAP_CREDIT_BASED_CONNECTION_REQ (0x17)
Fix 2 typos in L2CAP_CREDIT_BASED_CONNECTION_REQ (0x17) handling function, that
cause BlueZ answer with L2CAP_CR_LE_INVALID_PARAMS or L2CAP_CR_LE_INVALID_SCID
error on a correct ECRED connection request.

Enchanced Credit Based Mode support was recently introduced with the commit
15f02b9105 ("Bluetooth: L2CAP: Add initial code
for Enhanced Credit Based Mode").

Signed-off-by: Konstantin Forostyan <konstantin.forostyan@peiker-cee.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-11 12:13:38 +02:00
Tedd Ho-Jeong An
eaa7b7228f Bluetooth: Fix advertising handle is set to 0
This patch fix the advertising handle is set to 0 regardless of actual
instance value. The affected commands are LE Set Advertising Set Random
Address, LE Set Extended Advertising Data, and LE Set Extended Scan
Response Data commands.

Signed-off-by: Tedd Ho-Jeong An <tedd.an@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-11 12:13:38 +02:00
Vladimir Oltean
ac02a451a6 net: dsa: sja1105: implement cross-chip bridging operations
sja1105 uses dsa_8021q for DSA tagging, a format which is VLAN at heart
and which is compatible with cascading. A complete description of this
tagging format is in net/dsa/tag_8021q.c, but a quick summary is that
each external-facing port tags incoming frames with a unique pvid, and
this special VLAN is transmitted as tagged towards the inside of the
system, and as untagged towards the exterior. The tag encodes the switch
id and the source port index.

This means that cross-chip bridging for dsa_8021q only entails adding
the dsa_8021q pvids of one switch to the RX filter of the other
switches. Everything else falls naturally into place, as long as the
bottom-end of ports (the leaves in the tree) is comprised exclusively of
dsa_8021q-compatible (i.e. sja1105 switches). Otherwise, there would be
a chance that a front-panel switch transmits a packet tagged with a
dsa_8021q header, header which it wouldn't be able to remove, and which
would hence "leak" out.

The only use case I tested (due to lack of board availability) was when
the sja1105 switches are part of disjoint trees (however, this doesn't
change the fact that multiple sja1105 switches still need unique switch
identifiers in such a system). But in principle, even "true" single-tree
setups (with DSA links) should work just as fine, except for a small
change which I can't test: dsa_towards_port should be used instead of
dsa_upstream_port (I made the assumption that the routing port that any
sja1105 should use towards its neighbours is the CPU port. That might
not hold true in other setups).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
Vladimir Oltean
3b7bc1f091 net: dsa: introduce a dsa_switch_find function
Somewhat similar to dsa_tree_find, dsa_switch_find returns a dsa_switch
structure pointer by searching for its tree index and switch index (the
parameters from dsa,member). To be used, for example, by drivers who
implement .crosschip_bridge_join and need a reference to the other
switch indicated to by the tree_index and sw_index arguments.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
Vladimir Oltean
f66a6a69f9 net: dsa: permit cross-chip bridging between all trees in the system
One way of utilizing DSA is by cascading switches which do not all have
compatible taggers. Consider the following real-life topology:

      +---------------------------------------------------------------+
      | LS1028A                                                       |
      |               +------------------------------+                |
      |               |      DSA master for Felix    |                |
      |               |(internal ENETC port 2: eno2))|                |
      |  +------------+------------------------------+-------------+  |
      |  | Felix embedded L2 switch                                |  |
      |  |                                                         |  |
      |  | +--------------+   +--------------+   +--------------+  |  |
      |  | |DSA master for|   |DSA master for|   |DSA master for|  |  |
      |  | |  SJA1105 1   |   |  SJA1105 2   |   |  SJA1105 3   |  |  |
      |  | |(Felix port 1)|   |(Felix port 2)|   |(Felix port 3)|  |  |
      +--+-+--------------+---+--------------+---+--------------+--+--+

+-----------------------+ +-----------------------+ +-----------------------+
|   SJA1105 switch 1    | |   SJA1105 switch 2    | |   SJA1105 switch 3    |
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
|sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+

The above can be described in the device tree as follows (obviously not
complete):

mscc_felix {
	dsa,member = <0 0>;
	ports {
		port@4 {
			ethernet = <&enetc_port2>;
		};
	};
};

sja1105_switch1 {
	dsa,member = <1 1>;
	ports {
		port@4 {
			ethernet = <&mscc_felix_port1>;
		};
	};
};

sja1105_switch2 {
	dsa,member = <2 2>;
	ports {
		port@4 {
			ethernet = <&mscc_felix_port2>;
		};
	};
};

sja1105_switch3 {
	dsa,member = <3 3>;
	ports {
		port@4 {
			ethernet = <&mscc_felix_port3>;
		};
	};
};

Basically we instantiate one DSA switch tree for every hardware switch
in the system, but we still give them globally unique switch IDs (will
come back to that later). Having 3 disjoint switch trees makes the
tagger drivers "just work", because net devices are registered for the
3 Felix DSA master ports, and they are also DSA slave ports to the ENETC
port. So packets received on the ENETC port are stripped of their
stacked DSA tags one by one.

Currently, hardware bridging between ports on the same sja1105 chip is
possible, but switching between sja1105 ports on different chips is
handled by the software bridge. This is fine, but we can do better.

In fact, the dsa_8021q tag used by sja1105 is compatible with cascading.
In other words, a sja1105 switch can correctly parse and route a packet
containing a dsa_8021q tag. So if we could enable hardware bridging on
the Felix DSA master ports, cross-chip bridging could be completely
offloaded.

Such as system would be used as follows:

ip link add dev br0 type bridge && ip link set dev br0 up
for port in sw0p0 sw0p1 sw0p2 sw0p3 \
	    sw1p0 sw1p1 sw1p2 sw1p3 \
	    sw2p0 sw2p1 sw2p2 sw2p3; do
	ip link set dev $port master br0
done

The above makes switching between ports on the same row be performed in
hardware, and between ports on different rows in software. Now assume
the Felix switch ports are called swp0, swp1, swp2. By running the
following extra commands:

ip link add dev br1 type bridge && ip link set dev br1 up
for port in swp0 swp1 swp2; do
	ip link set dev $port master br1
done

the CPU no longer sees packets which traverse sja1105 switch boundaries
and can be forwarded directly by Felix. The br1 bridge would not be used
for any sort of traffic termination.

For this to work, we need to give drivers an opportunity to listen for
bridging events on DSA trees other than their own, and pass that other
tree index as argument. I have made the assumption, for the moment, that
the other existing DSA notifiers don't need to be broadcast to other
trees. That assumption might turn out to be incorrect. But in the
meantime, introduce a dsa_broadcast function, similar in purpose to
dsa_port_notify, which is used only by the bridging notifiers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
Vladimir Oltean
9eb8eff0cf net: bridge: allow enslaving some DSA master network devices
Commit 8db0a2ee2c ("net: bridge: reject DSA-enabled master netdevices
as bridge members") added a special check in br_if.c in order to check
for a DSA master network device with a tagging protocol configured. This
was done because back then, such devices, once enslaved in a bridge
would become inoperative and would not pass DSA tagged traffic anymore
due to br_handle_frame returning RX_HANDLER_CONSUMED.

But right now we have valid use cases which do require bridging of DSA
masters. One such example is when the DSA master ports are DSA switch
ports themselves (in a disjoint tree setup). This should be completely
equivalent, functionally speaking, from having multiple DSA switches
hanging off of the ports of a switchdev driver. So we should allow the
enslaving of DSA tagged master network devices.

Instead of the regular br_handle_frame(), install a new function
br_handle_frame_dummy() on these DSA masters, which returns
RX_HANDLER_PASS in order to call into the DSA specific tagging protocol
handlers, and lift the restriction from br_add_if.

Suggested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
NeilBrown
31e9a7f353 SUNRPC: fix use-after-free in rpc_free_client_work()
Parts of rpc_free_client() were recently moved to
a separate rpc_free_clent_work().  This introduced
a use-after-free as rpc_clnt_remove_pipedir() calls
rpc_net_ns(), and that uses clnt->cl_xprt which has already
been freed.
So move the call to xprt_put() after the call to
rpc_clnt_remove_pipedir().

Reported-by: syzbot+22b5ef302c7c40d94ea8@syzkaller.appspotmail.com
Fixes: 7c4310ff56 ("SUNRPC: defer slow parts of rpc_free_client() to a workqueue.")
Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-05-10 19:44:56 -04:00
Arnd Bergmann
2c407aca64 netfilter: conntrack: avoid gcc-10 zero-length-bounds warning
gcc-10 warns around a suspicious access to an empty struct member:

net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_alloc':
net/netfilter/nf_conntrack_core.c:1522:9: warning: array subscript 0 is outside the bounds of an interior zero-length array 'u8[0]' {aka 'unsigned char[0]'} [-Wzero-length-bounds]
 1522 |  memset(&ct->__nfct_init_offset[0], 0,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from net/netfilter/nf_conntrack_core.c:37:
include/net/netfilter/nf_conntrack.h:90:5: note: while referencing '__nfct_init_offset'
   90 |  u8 __nfct_init_offset[0];
      |     ^~~~~~~~~~~~~~~~~~

The code is correct but a bit unusual. Rework it slightly in a way that
does not trigger the warning, using an empty struct instead of an empty
array. There are probably more elegant ways to do this, but this is the
smallest change.

Fixes: c41884ce05 ("netfilter: conntrack: avoid zeroing timer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-10 23:20:24 +02:00
Andrew Lunn
9896a4574e net: phy: Send notifier when starting the cable test
Given that it takes time to run a cable test, send a notify message at
the start, as well as when it is completed.

v3:
EMSGSIZE when ethnl_bcastmsg_put() fails
Print an error message on failure, since this is a void function.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 12:28:41 -07:00
Andrew Lunn
1e2dc14509 net: ethtool: Add helpers for reporting test results
The PHY drivers can use these helpers for reporting the results. The
results get translated into netlink attributes which are added to the
pre-allocated skbuf.

v3:
Poison phydev->skb
Return -EMSGSIZE when ethnl_bcastmsg_put() fails
Return valid error code when nla_nest_start() fails
Use u8 for results
Actually put u32 length into message

v4:
s/ENOTSUPP/EOPNOTSUPP/g

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 12:28:41 -07:00
Andrew Lunn
1dd3f212af net: ethtool: Add infrastructure for reporting cable test results
Provide infrastructure for PHY drivers to report the cable test
results.  A netlink skb is associated to the phydev. Helpers will be
added which can add results to this skb. Once the test has finished
the results are sent to user space.

When netlink ethtool is not part of the kernel configuration stubs are
provided. It is also impossible to trigger a cable test, so the error
code returned by the alloc function is of no consequence.

v2:
Include the status complete in the netlink notification message

v4:
Replace -EINVAL with -EMSGSIZE

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 12:28:41 -07:00
Andrew Lunn
0df960f14e net: ethtool: Make helpers public
Make some helpers for building ethtool netlink messages available
outside the compilation unit, so they can be used for building
messages which are not simple get/set.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 12:28:41 -07:00
Andrew Lunn
11ca3c4261 net: ethtool: netlink: Add support for triggering a cable test
Add new ethtool netlink calls to trigger the starting of a PHY cable
test.

Add Kconfig'ury to ETHTOOL_NETLINK so that PHYLIB is not a module when
ETHTOOL_NETLINK is builtin, which would result in kernel linking errors.

v2:
Remove unwanted white space change
Remove ethnl_cable_test_act_ops and use doit handler
Rename cable_test_set_policy cable_test_act_policy
Remove ETHTOOL_MSG_CABLE_TEST_ACT_REPLY

v3:
Remove ETHTOOL_MSG_CABLE_TEST_ACT_REPLY from documentation
Remove unused cable_test_get_policy
Add Reviewed-by tags

v4:
Remove unwanted blank line

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 12:28:41 -07:00
Zefan Li
090e28b229 netprio_cgroup: Fix unlimited memory leak of v2 cgroups
If systemd is configured to use hybrid mode which enables the use of
both cgroup v1 and v2, systemd will create new cgroup on both the default
root (v2) and netprio_cgroup hierarchy (v1) for a new session and attach
task to the two cgroups. If the task does some network thing then the v2
cgroup can never be freed after the session exited.

One of our machines ran into OOM due to this memory leak.

In the scenario described above when sk_alloc() is called
cgroup_sk_alloc() thought it's in v2 mode, so it stores
the cgroup pointer in sk->sk_cgrp_data and increments
the cgroup refcnt, but then sock_update_netprioidx()
thought it's in v1 mode, so it stores netprioidx value
in sk->sk_cgrp_data, so the cgroup refcnt will never be freed.

Currently we do the mode switch when someone writes to the ifpriomap
cgroup control file. The easiest fix is to also do the switch when
a task is attached to a new cgroup.

Fixes: bd1060a1d6 ("sock, cgroup: add sock->sk_cgroup")
Reported-by: Yang Yingliang <yangyingliang@huawei.com>
Tested-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Zefan Li <lizefan@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-09 20:59:21 -07:00
Yonghong Song
138d0be35b net: bpf: Add netlink and ipv6_route bpf_iter targets
This patch added netlink and ipv6_route targets, using
the same seq_ops (except show() and minor changes for stop())
for /proc/net/{netlink,ipv6_route}.

The net namespace for these targets are the current net
namespace at file open stage, similar to
/proc/net/{netlink,ipv6_route} reference counting
the net namespace at seq_file open stage.

Since module is not supported for now, ipv6_route is
supported only if the IPV6 is built-in, i.e., not compiled
as a module. The restriction can be lifted once module
is properly supported for bpf_iter.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200509175910.2476329-1-yhs@fb.com
2020-05-09 17:05:26 -07:00
Christian Brauner
f2a8d52e0a
nsproxy: add struct nsset
Add a simple struct nsset. It holds all necessary pieces to switch to a new
set of namespaces without leaving a task in a half-switched state which we
will make use of in the next patch. This patch switches the existing setns
logic over without causing a change in setns() behavior. This brings
setns() closer to how unshare() works(). The prepare_ns() function is
responsible to prepare all necessary information. This has two reasons.
First it minimizes dependencies between individual namespaces, i.e. all
install handler can expect that all fields are properly initialized
independent in what order they are called in. Second, this makes the code
easier to maintain and easier to follow if it needs to be changed.

The prepare_ns() helper will only be switched over to use a flags argument
in the next patch. Here it will still use nstype as a simple integer
argument which was argued would be clearer. I'm not particularly
opinionated about this if it really helps or not. The struct nsset itself
already contains the flags field since its name already indicates that it
can contain information required by different namespaces. None of this
should have functional consequences.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Reviewed-by: Serge Hallyn <serge@hallyn.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
Cc: Serge Hallyn <serge@hallyn.com>
Cc: Jann Horn <jannh@google.com>
Cc: Michael Kerrisk <mtk.manpages@gmail.com>
Cc: Aleksa Sarai <cyphar@cyphar.com>
Link: https://lore.kernel.org/r/20200505140432.181565-2-christian.brauner@ubuntu.com
2020-05-09 13:57:12 +02:00
Saeed Mahameed
76cd622fe2 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge includes updates to bonding driver needed for the rdma stack,
to avoid conflicts with the RDMA branch.

Maor Gottlieb Says:

====================
Bonding: Add support to get xmit slave

The following series adds support to get the LAG master xmit slave by
introducing new .ndo - ndo_get_xmit_slave. Every LAG module can
implement it and it first implemented in the bond driver.
This is follow-up to the RFC discussion [1].

The main motivation for doing this is for drivers that offload part
of the LAG functionality. For example, Mellanox Connect-X hardware
implements RoCE LAG which selects the TX affinity when the resources
are created and port is remapped when it goes down.

The first part of this patchset introduces the new .ndo and add the
support to the bonding module.

The second part adds support to get the RoCE LAG xmit slave by building
skb of the RoCE packet based on the AH attributes and call to the new
.ndo.

The third part change the mlx5 driver driver to set the QP's affinity
port according to the slave which found by the .ndo.
====================

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-09 01:05:30 -07:00
Paolo Abeni
57644431a6 net: ipv4: really enforce backoff for redirects
In commit b406472b5a ("net: ipv4: avoid mixed n_redirects and
rate_tokens usage") I missed the fact that a 0 'rate_tokens' will
bypass the backoff algorithm.

Since rate_tokens is cleared after a redirect silence, and never
incremented on redirects, if the host keeps receiving packets
requiring redirect it will reply ignoring the backoff.

Additionally, the 'rate_last' field will be updated with the
cadence of the ingress packet requiring redirect. If that rate is
high enough, that will prevent the host from generating any
other kind of ICMP messages

The check for a zero 'rate_tokens' value was likely a shortcut
to avoid the more complex backoff algorithm after a redirect
silence period. Address the issue checking for 'n_redirects'
instead, which is incremented on successful redirect, and
does not interfere with other ICMP replies.

Fixes: b406472b5a ("net: ipv4: avoid mixed n_redirects and rate_tokens usage")
Reported-and-tested-by: Colin Walters <walters@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 22:52:57 -07:00
Eric Dumazet
d8882935fc ipv6: use DST_NOCOUNT in ip6_rt_pcpu_alloc()
We currently have to adjust ipv6 route gc_thresh/max_size depending
on number of cpus on a server, this makes very little sense.

If the kernels sets /proc/sys/net/ipv6/route/gc_thresh to 1024
and /proc/sys/net/ipv6/route/max_size to 4096, then we better
not track the percpu dst that our implementation uses.

Only routes not added (directly or indirectly) by the admin
should be tracked and limited.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: David Ahern <dsahern@kernel.org>
Cc: Maciej Żenczykowski <maze@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 22:34:17 -07:00
Yang Yingliang
3712c1c2ef ieee802154: 6lowpan: remove unnecessary comparison
The type of dispatch is u8 which is always '<=' 0xff, so the
dispatch <= 0xff is always true, we can remove this comparison.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 22:25:10 -07:00
Eric Dumazet
cf86a086a1 net/dst: use a smaller percpu_counter batch for dst entries accounting
percpu_counter_add() uses a default batch size which is quite big
on platforms with 256 cpus. (2*256 -> 512)

This means dst_entries_get_fast() can be off by +/- 2*(nr_cpus^2)
(131072 on servers with 256 cpus)

Reduce the batch size to something more reasonable, and
add logic to ip6_dst_gc() to call dst_entries_get_slow()
before calling the _very_ expensive fib6_run_gc() function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 21:33:33 -07:00
Jakub Kicinski
14d8f7486a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-05-09

The following pull-request contains BPF updates for your *net* tree.

We've added 4 non-merge commits during the last 9 day(s) which contain
a total of 4 files changed, 11 insertions(+), 6 deletions(-).

The main changes are:

1) Fix msg_pop_data() helper incorrectly setting an sge length in some
   cases as well as fixing bpf_tcp_ingress() wrongly accounting bytes
   in sg.size, from John Fastabend.

2) Fix to return an -EFAULT error when copy_to_user() of the value
   fails in map_lookup_and_delete_elem(), from Wei Yongjun.

3) Fix sk_psock refcnt leak in tcp_bpf_recvmsg(), from Xiyu Yang.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 18:58:39 -07:00
Kelly Littlepage
cc4de047b3 net: tcp: fix rx timestamp behavior for tcp_recvmsg
The stated intent of the original commit is to is to "return the timestamp
corresponding to the highest sequence number data returned." The current
implementation returns the timestamp for the last byte of the last fully
read skb, which is not necessarily the last byte in the recv buffer. This
patch converts behavior to the original definition, and to the behavior of
the previous draft versions of commit 98aaa913b4 ("tcp: Extend
SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg") which also match this
behavior.

Fixes: 98aaa913b4 ("tcp: Extend SOF_TIMESTAMPING_RX_SOFTWARE to TCP recvmsg")
Co-developed-by: Iris Liu <iris@onechronos.com>
Signed-off-by: Iris Liu <iris@onechronos.com>
Signed-off-by: Kelly Littlepage <kelly@onechronos.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 16:09:22 -07:00
Stanislav Fomichev
8086fbaf49 bpf: Allow any port in bpf_bind helper
We want to have a tighter control on what ports we bind to in
the BPF_CGROUP_INET{4,6}_CONNECT hooks even if it means
connect() becomes slightly more expensive. The expensive part
comes from the fact that we now need to call inet_csk_get_port()
that verifies that the port is not used and allocates an entry
in the hash table for it.

Since we can't rely on "snum || !bind_address_no_port" to prevent
us from calling POST_BIND hook anymore, let's add another bind flag
to indicate that the call site is BPF program.

v5:
* fix wrong AF_INET (should be AF_INET6) in the bpf program for v6

v3:
* More bpf_bind documentation refinements (Martin KaFai Lau)
* Add UDP tests as well (Martin KaFai Lau)
* Don't start the thread, just do socket+bind+listen (Martin KaFai Lau)

v2:
* Update documentation (Andrey Ignatov)
* Pass BIND_FORCE_ADDRESS_NO_PORT conditionally (Andrey Ignatov)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-5-sdf@google.com
2020-05-09 00:48:20 +02:00
Stanislav Fomichev
cb0721c7e2 net: Refactor arguments of inet{,6}_bind
The intent is to add an additional bind parameter in the next commit.
Instead of adding another argument, let's convert all existing
flag arguments into an extendable bit field.

No functional changes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-4-sdf@google.com
2020-05-09 00:48:20 +02:00
Eric Biggers
2aaba014b5 crypto: lib/sha1 - remove unnecessary includes of linux/cryptohash.h
<linux/cryptohash.h> sounds very generic and important, like it's the
header to include if you're doing cryptographic hashing in the kernel.
But actually it only includes the library implementation of the SHA-1
compression function (not even the full SHA-1).  This should basically
never be used anymore; SHA-1 is no longer considered secure, and there
are much better ways to do cryptographic hashing in the kernel.

Most files that include this header don't actually need it.  So in
preparation for removing it, remove all these unneeded includes of it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:17 +10:00
Eric Biggers
6b0b0fa2bc crypto: lib/sha1 - rename "sha" to "sha1"
The library implementation of the SHA-1 compression function is
confusingly called just "sha_transform()".  Alongside it are some "SHA_"
constants and "sha_init()".  Presumably these are left over from a time
when SHA just meant SHA-1.  But now there are also SHA-2 and SHA-3, and
moreover SHA-1 is now considered insecure and thus shouldn't be used.

Therefore, rename these functions and constants to make it very clear
that they are for SHA-1.  Also add a comment to make it clear that these
shouldn't be used.

For the extra-misleadingly named "SHA_MESSAGE_BYTES", rename it to
SHA1_BLOCK_SIZE and define it to just '64' rather than '(512/8)' so that
it matches the same definition in <crypto/sha.h>.  This prepares for
merging <linux/cryptohash.h> into <crypto/sha.h>.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:17 +10:00
Eric Biggers
ac0ad93df7 mptcp: use SHA256_BLOCK_SIZE, not SHA_MESSAGE_BYTES
In preparation for naming the SHA-1 stuff in <linux/cryptohash.h>
properly and moving it to a more appropriate header, fix the HMAC-SHA256
code in mptcp_crypto_hmac_sha() to use SHA256_BLOCK_SIZE instead of
"SHA_MESSAGE_BYTES" which is actually the SHA-1 block size.
(Fortunately these are both 64 bytes, so this wasn't a "real" bug...)

Cc: Paolo Abeni <pabeni@redhat.com>
Cc: mptcp@lists.01.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:16 +10:00
Eric Biggers
75b93c6354 sctp: use crypto_shash_tfm_digest()
Instead of manually allocating a 'struct shash_desc' on the stack and
calling crypto_shash_digest(), switch to using the new helper function
crypto_shash_tfm_digest() which does this for us.

Cc: linux-sctp@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:15 +10:00
Eric Biggers
ec0bf6edc4 Bluetooth: use crypto_shash_tfm_digest()
Instead of manually allocating a 'struct shash_desc' on the stack and
calling crypto_shash_digest(), switch to using the new helper function
crypto_shash_tfm_digest() which does this for us.

Cc: linux-bluetooth@vger.kernel.org
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:15 +10:00
Cong Wang
dd912306ff net: fix a potential recursive NETDEV_FEAT_CHANGE
syzbot managed to trigger a recursive NETDEV_FEAT_CHANGE event
between bonding master and slave. I managed to find a reproducer
for this:

  ip li set bond0 up
  ifenslave bond0 eth0
  brctl addbr br0
  ethtool -K eth0 lro off
  brctl addif br0 bond0
  ip li set br0 up

When a NETDEV_FEAT_CHANGE event is triggered on a bonding slave,
it captures this and calls bond_compute_features() to fixup its
master's and other slaves' features. However, when syncing with
its lower devices by netdev_sync_lower_features() this event is
triggered again on slaves when the LRO feature fails to change,
so it goes back and forth recursively until the kernel stack is
exhausted.

Commit 17b85d29e8 intentionally lets __netdev_update_features()
return -1 for such a failure case, so we have to just rely on
the existing check inside netdev_sync_lower_features() and skip
NETDEV_FEAT_CHANGE event only for this specific failure case.

Fixes: fd867d51f8 ("net/core: generic support for disabling netdev features down stack")
Reported-by: syzbot+e73ceacfd8560cc8a3ca@syzkaller.appspotmail.com
Reported-by: syzbot+c2fb6f9ddcea95ba49b5@syzkaller.appspotmail.com
Cc: Jarod Wilson <jarod@redhat.com>
Cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Jann Horn <jannh@google.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:18:36 -07:00
Eric Dumazet
790709f249 net: relax SO_TXTIME CAP_NET_ADMIN check
Now sch_fq has horizon feature, we want to allow QUIC/UDP applications
to use EDT model so that pacing can be offloaded to the kernel (sch_fq)
or the NIC.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:17:32 -07:00
Paolo Abeni
7d14b0d2b9 mptcp: set correct vfs info for subflows
When a subflow is created via mptcp_subflow_create_socket(),
a new 'struct socket' is allocated, with a new i_ino value.

When inspecting TCP sockets via the procfs and or the diag
interface, the above ones are not related to the process owning
the MPTCP master socket, even if they are a logical part of it
('ss -p' shows an empty process field)

Additionally, subflows created by the path manager get
the uid/gid from the running workqueue.

Subflows are part of the owning MPTCP master socket, let's
adjust the vfs info to reflect this.

After this patch, 'ss' correctly displays subflows as belonging
to the msk socket creator.

Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:16:05 -07:00
Eric Dumazet
f78ed2204d netpoll: accept NULL np argument in netpoll_send_skb()
netpoll_send_skb() callers seem to leak skb if
the np pointer is NULL. While this should not happen, we
can make the code more robust.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
Eric Dumazet
1ddabdfaf7 netpoll: netpoll_send_skb() returns transmit status
Some callers want to know if the packet has been sent or
dropped, to inform upper stacks.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
Eric Dumazet
fb1eee476b netpoll: move netpoll_send_skb() out of line
There is no need to inline this helper, as we intend to add more
code in this function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
Eric Dumazet
307f660d05 netpoll: remove dev argument from netpoll_send_skb_on_dev()
netpoll_send_skb_on_dev() can get the device pointer directly from np->dev

Rename it to __netpoll_send_skb()

Following patch will move netpoll_send_skb() out-of-line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
YueHaibing
ca7e3edc22 net/smc: remove set but not used variables 'del_llc, del_llc_resp'
Fixes gcc '-Wunused-but-set-variable' warning:

net/smc/smc_llc.c: In function 'smc_llc_cli_conf_link':
net/smc/smc_llc.c:753:31: warning:
 variable 'del_llc' set but not used [-Wunused-but-set-variable]
  struct smc_llc_msg_del_link *del_llc;
                               ^
net/smc/smc_llc.c: In function 'smc_llc_process_srv_delete_link':
net/smc/smc_llc.c:1311:33: warning:
 variable 'del_llc_resp' set but not used [-Wunused-but-set-variable]
    struct smc_llc_msg_del_link *del_llc_resp;
                                 ^

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:05:07 -07:00
zhang kai
636ef28d6e tcp: tcp_mark_head_lost is only valid for sack-tcp
so tcp_is_sack/reno checks are removed from tcp_mark_head_lost.

Signed-off-by: zhang kai <zhangkaiheb@126.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:58:23 -07:00
Jacob Keller
c75a33c84b net: remove newlines in NL_SET_ERR_MSG_MOD
The NL_SET_ERR_MSG_MOD macro is used to report a string describing an
error message to userspace via the netlink extended ACK structure. It
should not have a trailing newline.

Add a cocci script which catches cases where the newline marker is
present. Using this script, fix the handful of cases which accidentally
included a trailing new line.

I couldn't figure out a way to get a patch mode working, so this script
only implements context, report, and org.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Andy Whitcroft <apw@canonical.com>
Cc: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:56:14 -07:00
Taehee Yoo
f96e87178b hsr: remove WARN_ONCE() in hsr_fill_frame_info()
When VLAN frame is being sent, hsr calls WARN_ONCE() because hsr doesn't
support VLAN. But using WARN_ONCE() is overdoing.
Using netdev_warn_once() is enough.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:40:02 -07:00
Vladimir Oltean
e1eea81120 net: dsa: introduce a dsa_port_from_netdev public helper
As its implementation shows, this is synonimous with calling
dsa_slave_dev_check followed by dsa_slave_to_port, so it is quite simple
already and provides functionality which is already there.

However there is now a need for these functions outside dsa_priv.h, for
example in drivers that perform mirroring and redirection through
tc-flower offloads (they are given raw access to the flow_cls_offload
structure), where they need to call this function on act->dev.

But simply exporting dsa_slave_to_port would make it non-inline and
would result in an extra function call in the hotpath, as can be seen
for example in sja1105:

Before:

000006dc <sja1105_xmit>:
{
 6dc:	e92d4ff0 	push	{r4, r5, r6, r7, r8, r9, sl, fp, lr}
 6e0:	e1a04000 	mov	r4, r0
 6e4:	e591958c 	ldr	r9, [r1, #1420]	; 0x58c <- Inline dsa_slave_to_port
 6e8:	e1a05001 	mov	r5, r1
 6ec:	e24dd004 	sub	sp, sp, #4
	u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
 6f0:	e1c901d8 	ldrd	r0, [r9, #24]
 6f4:	ebfffffe 	bl	0 <dsa_8021q_tx_vid>
			6f4: R_ARM_CALL	dsa_8021q_tx_vid
	u8 pcp = netdev_txq_to_tc(netdev, queue_mapping);
 6f8:	e1d416b0 	ldrh	r1, [r4, #96]	; 0x60
	u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
 6fc:	e1a08000 	mov	r8, r0

After:

000006e4 <sja1105_xmit>:
{
 6e4:	e92d4ff0 	push	{r4, r5, r6, r7, r8, r9, sl, fp, lr}
 6e8:	e1a04000 	mov	r4, r0
 6ec:	e24dd004 	sub	sp, sp, #4
	struct dsa_port *dp = dsa_slave_to_port(netdev);
 6f0:	e1a00001 	mov	r0, r1
{
 6f4:	e1a05001 	mov	r5, r1
	struct dsa_port *dp = dsa_slave_to_port(netdev);
 6f8:	ebfffffe 	bl	0 <dsa_slave_to_port>
			6f8: R_ARM_CALL	dsa_slave_to_port
 6fc:	e1a09000 	mov	r9, r0
	u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
 700:	e1c001d8 	ldrd	r0, [r0, #24]
 704:	ebfffffe 	bl	0 <dsa_8021q_tx_vid>
			704: R_ARM_CALL	dsa_8021q_tx_vid

Because we want to avoid possible performance regressions, introduce
this new function which is designed to be public.

Suggested-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:31:57 -07:00
Maciej Żenczykowski
09454fd0a4 Revert "ipv6: add mtu lock check in __ip6_rt_update_pmtu"
This reverts commit 19bda36c42:

| ipv6: add mtu lock check in __ip6_rt_update_pmtu
|
| Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu.
| It leaded to that mtu lock doesn't really work when receiving the pkt
| of ICMPV6_PKT_TOOBIG.
|
| This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4
| did in __ip_rt_update_pmtu.

The above reasoning is incorrect.  IPv6 *requires* icmp based pmtu to work.
There's already a comment to this effect elsewhere in the kernel:

  $ git grep -p -B1 -A3 'RTAX_MTU lock'
  net/ipv6/route.c=4813=

  static int rt6_mtu_change_route(struct fib6_info *f6i, void *p_arg)
  ...
    /* In IPv6 pmtu discovery is not optional,
       so that RTAX_MTU lock cannot disable it.
       We still use this lock to block changes
       caused by addrconf/ndisc.
    */

This reverts to the pre-4.9 behaviour.

Cc: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Xin Long <lucien.xin@gmail.com>
Cc: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Fixes: 19bda36c42 ("ipv6: add mtu lock check in __ip6_rt_update_pmtu")
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:29:12 -07:00
Manivannan Sadhasivam
e426710843 net: qrtr: Do not depend on ARCH_QCOM
IPC Router protocol is also used by external modems for exchanging the QMI
messages. Hence, it doesn't always depend on Qualcomm platforms. One such
instance is the QCA6390 WLAN device connected to x86 machine.

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 13:21:12 -07:00
Manivannan Sadhasivam
6e728f3213 net: qrtr: Add MHI transport layer
MHI is the transport layer used for communicating to the external modems.
Hence, this commit adds MHI transport layer support to QRTR for
transferring the QMI messages over IPC Router.

Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 13:21:12 -07:00
David S. Miller
3793faad7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts were all overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 22:10:13 -07:00
Linus Torvalds
a811c1fa0a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix reference count leaks in various parts of batman-adv, from Xiyu
    Yang.

 2) Update NAT checksum even when it is zero, from Guillaume Nault.

 3) sk_psock reference count leak in tls code, also from Xiyu Yang.

 4) Sanity check TCA_FQ_CODEL_DROP_BATCH_SIZE netlink attribute in
    fq_codel, from Eric Dumazet.

 5) Fix panic in choke_reset(), also from Eric Dumazet.

 6) Fix VLAN accel handling in bnxt_fix_features(), from Michael Chan.

 7) Disallow out of range quantum values in sch_sfq, from Eric Dumazet.

 8) Fix crash in x25_disconnect(), from Yue Haibing.

 9) Don't pass pointer to local variable back to the caller in
    nf_osf_hdr_ctx_init(), from Arnd Bergmann.

10) Wireguard should use the ECN decap helper functions, from Toke
    Høiland-Jørgensen.

11) Fix command entry leak in mlx5 driver, from Moshe Shemesh.

12) Fix uninitialized variable access in mptcp's
    subflow_syn_recv_sock(), from Paolo Abeni.

13) Fix unnecessary out-of-order ingress frame ordering in macsec, from
    Scott Dial.

14) IPv6 needs to use a global serial number for dst validation just
    like ipv4, from David Ahern.

15) Fix up PTP_1588_CLOCK deps, from Clay McClure.

16) Missing NLM_F_MULTI flag in gtp driver netlink messages, from
    Yoshiyuki Kurauchi.

17) Fix a regression in that dsa user port errors should not be fatal,
    from Florian Fainelli.

18) Fix iomap leak in enetc driver, from Dejin Zheng.

19) Fix use after free in lec_arp_clear_vccs(), from Cong Wang.

20) Initialize protocol value earlier in neigh code paths when
    generating events, from Roman Mashak.

21) netdev_update_features() must be called with RTNL mutex in macsec
    driver, from Antoine Tenart.

22) Validate untrusted GSO packets even more strictly, from Willem de
    Bruijn.

23) Wireguard decrypt worker needs a cond_resched(), from Jason
    Donenfeld.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (111 commits)
  net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE
  MAINTAINERS: put DYNAMIC INTERRUPT MODERATION in proper order
  wireguard: send/receive: use explicit unlikely branch instead of implicit coalescing
  wireguard: selftests: initalize ipv6 members to NULL to squelch clang warning
  wireguard: send/receive: cond_resched() when processing worker ringbuffers
  wireguard: socket: remove errant restriction on looping to self
  wireguard: selftests: use normal kernel stack size on ppc64
  net: ethernet: ti: am65-cpsw-nuss: fix irqs type
  ionic: Use debugfs_create_bool() to export bool
  net: dsa: Do not leave DSA master with NULL netdev_ops
  net: dsa: remove duplicate assignment in dsa_slave_add_cls_matchall_mirred
  net: stricter validation of untrusted gso packets
  seg6: fix SRH processing to comply with RFC8754
  net: mscc: ocelot: ANA_AUTOAGE_AGE_PERIOD holds a value in seconds, not ms
  net: dsa: ocelot: the MAC table on Felix is twice as large
  net: dsa: sja1105: the PTP_CLK extts input reacts on both edges
  selftests: net: tcp_mmap: fix SO_RCVLOWAT setting
  net: hsr: fix incorrect type usage for protocol variable
  net: macsec: fix rtnl locking issue
  net: mvpp2: cls: Prevent buffer overflow in mvpp2_ethtool_cls_rule_del()
  ...
2020-05-06 20:53:22 -07:00
Pablo Neira Ayuso
16f8036086 net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE
This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
that the frontend does not need counters, this hw stats type request
never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
the driver to disable the stats, however, if the driver cannot disable
counters, it bails out.

TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
(this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.

Fixes: 319a1d1947 ("flow_offload: check for basic action hw stats type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 20:13:10 -07:00
Oleksij Rempel
bdbdac7649 ethtool: provide UAPI for PHY master/slave configuration.
This UAPI is needed for BroadR-Reach 100BASE-T1 devices. Due to lack of
auto-negotiation support, we needed to be able to configure the
MASTER-SLAVE role of the port manually or from an application in user
space.

The same UAPI can be used for 1000BASE-T or MultiGBASE-T devices to
force MASTER or SLAVE role. See IEEE 802.3-2018:
22.2.4.3.7 MASTER-SLAVE control register (Register 9)
22.2.4.3.8 MASTER-SLAVE status register (Register 10)
40.5.2 MASTER-SLAVE configuration resolution
45.2.1.185.1 MASTER-SLAVE config value (1.2100.14)
45.2.7.10 MultiGBASE-T AN control 1 register (Register 7.32)

The MASTER-SLAVE role affects the clock configuration:

-------------------------------------------------------------------------------
When the  PHY is configured as MASTER, the PMA Transmit function shall
source TX_TCLK from a local clock source. When configured as SLAVE, the
PMA Transmit function shall source TX_TCLK from the clock recovered from
data stream provided by MASTER.

iMX6Q                     KSZ9031                XXX
------\                /-----------\        /------------\
      |                |           |        |            |
 MAC  |<----RGMII----->| PHY Slave |<------>| PHY Master |
      |<--- 125 MHz ---+-<------/  |        | \          |
------/                \-----------/        \------------/
                                               ^
                                                \-TX_TCLK

-------------------------------------------------------------------------------

Since some clock or link related issues are only reproducible in a
specific MASTER-SLAVE-role, MAC and PHY configuration, it is beneficial
to provide generic (not 100BASE-T1 specific) interface to the user space
for configuration flexibility and trouble shooting.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:45:45 -07:00
Florian Fainelli
050569fc83 net: dsa: Do not leave DSA master with NULL netdev_ops
When ndo_get_phys_port_name() for the CPU port was added we introduced
an early check for when the DSA master network device in
dsa_master_ndo_setup() already implements ndo_get_phys_port_name(). When
we perform the teardown operation in dsa_master_ndo_teardown() we would
not be checking that cpu_dp->orig_ndo_ops was successfully allocated and
non-NULL initialized.

With network device drivers such as virtio_net, this leads to a NPD as
soon as the DSA switch hanging off of it gets torn down because we are
now assigning the virtio_net device's netdev_ops a NULL pointer.

Fixes: da7b9e9b00 ("net: dsa: Add ndo_get_phys_port_name() for CPU port")
Reported-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Allen Pais <allen.pais@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:31:54 -07:00
Vladimir Oltean
657221598f net: dsa: remove duplicate assignment in dsa_slave_add_cls_matchall_mirred
This was caused by a poor merge conflict resolution on my side. The
"act = &cls->rule->action.entries[0];" assignment was already present in
the code prior to the patch mentioned below.

Fixes: e13c207528 ("net: dsa: refactor matchall mirred action to separate function")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:30:35 -07:00
Eric Dumazet
916e6d1a5e tcp: defer xmit timer reset in tcp_xmit_retransmit_queue()
As hinted in prior change ("tcp: refine tcp_pacing_delay()
for very low pacing rates"), it is probably best arming
the xmit timer only when all the packets have been scheduled,
rather than when the head of rtx queue has been re-sent.

This does matter for flows having extremely low pacing rates,
since their tp->tcp_wstamp_ns could be far in the future.

Note that the regular xmit path has a stronger limit
in tcp_small_queue_check(), meaning it is less likely to
go beyond the pacing horizon.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:29:38 -07:00
Eric Dumazet
8dc242ad66 tcp: refine tcp_pacing_delay() for very low pacing rates
With the addition of horizon feature to sch_fq, we noticed some
suboptimal behavior of extremely low pacing rate TCP flows, especially
when TCP is not aware of a drop happening in lower stacks.

Back in commit 3f80e08f40 ("tcp: add tcp_reset_xmit_timer() helper"),
tcp_pacing_delay() was added to estimate an extra delay to add to standard
rto timers.

This patch removes the skb argument from this helper and
tcp_reset_xmit_timer() because it makes more sense to simply
consider the time at which next packet is allowed to be sent,
instead of the time of whatever packet has been sent.

This avoids arming RTO timer too soon and removes
spurious horizon drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:29:38 -07:00
Ahmed Abdelsalam
0cb7498f23 seg6: fix SRH processing to comply with RFC8754
The Segment Routing Header (SRH) which defines the SRv6 dataplane is defined
in RFC8754.

RFC8754 (section 4.1) defines the SR source node behavior which encapsulates
packets into an outer IPv6 header and SRH. The SR source node encodes the
full list of Segments that defines the packet path in the SRH. Then, the
first segment from list of Segments is copied into the Destination address
of the outer IPv6 header and the packet is sent to the first hop in its path
towards the destination.

If the Segment list has only one segment, the SR source node can omit the SRH
as he only segment is added in the destination address.

RFC8754 (section 4.1.1) defines the Reduced SRH, when a source does not
require the entire SID list to be preserved in the SRH. A reduced SRH does
not contain the first segment of the related SR Policy (the first segment is
the one already in the DA of the IPv6 header), and the Last Entry field is
set to n-2, where n is the number of elements in the SR Policy.

RFC8754 (section 4.3.1.1) defines the SRH processing and the logic to
validate the SRH (S09, S10, S11) which works for both reduced and
non-reduced behaviors.

This patch updates seg6_validate_srh() to validate the SRH as per RFC8754.

Signed-off-by: Ahmed Abdelsalam <ahabdels@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:21:35 -07:00
Fernando Gont
969c54646a ipv6: Implement draft-ietf-6man-rfc4941bis
Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09

* Reduces the default Valid Lifetime to 2 days
  The number of extra addresses employed when Valid Lifetime was
  7 days exacerbated the stress caused on network
  elements/devices. Additionally, the motivation for temporary
  addresses is indeed privacy and reduced exposure. With a
  default Valid Lifetime of 7 days, an address that becomes
  revealed by active communication is reachable and exposed for
  one whole week. The only use case for a Valid Lifetime of 7
  days could be some application that is expecting to have long
  lived connections. But if you want to have a long lived
  connections, you shouldn't be using a temporary address in the
  first place. Additionally, in the era of mobile devices, general
  applications should nevertheless be prepared and robust to
  address changes (e.g. nodes swap wifi <-> 4G, etc.)

* Employs different IIDs for different prefixes
  To avoid network activity correlation among addresses configured
  for different prefixes

* Uses a simpler algorithm for IID generation
  No need to store "history" anywhere

Signed-off-by: Fernando Gont <fgont@si6networks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:00:02 -07:00
Murali Karicheri
f5dda315b6 net: hsr: fix incorrect type usage for protocol variable
Fix following sparse checker warning:-

net/hsr/hsr_slave.c:38:18: warning: incorrect type in assignment (different base types)
net/hsr/hsr_slave.c:38:18:    expected unsigned short [unsigned] [usertype] protocol
net/hsr/hsr_slave.c:38:18:    got restricted __be16 [usertype] h_proto
net/hsr/hsr_slave.c:39:25: warning: restricted __be16 degrades to integer
net/hsr/hsr_slave.c:39:57: warning: restricted __be16 degrades to integer

Signed-off-by: Murali Karicheri <m-karicheri2@ti.com>
Acked-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 15:00:20 -07:00
Jason Yan
8741e18419 net: bridge: return false in br_mrp_enabled()
Fix the following coccicheck warning:

net/bridge/br_private.h:1334:8-9: WARNING: return of 0/1 in function
'br_mrp_enabled' with return type bool

Fixes: 6536993371 ("bridge: mrp: Integrate MRP into the bridge")
Signed-off-by: Jason Yan <yanaijie@huawei.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 13:57:30 -07:00
YueHaibing
31fb4bf545 sunrpc: Remove unused function ip_map_update
commit 49b28684fd ("nfsd: Remove deprecated nfsctl system call and related code.")
left behind this, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-05-06 16:00:11 -04:00
Jason Gunthorpe
11a0ae4c4b RDMA: Allow ib_client's to fail when add() is called
When a client is added it isn't allowed to fail, but all the client's have
various failure paths within their add routines.

This creates the very fringe condition where the client was added, failed
during add and didn't set the client_data. The core code will then still
call other client_data centric ops like remove(), rename(), get_nl_info(),
and get_net_dev_by_params() with NULL client_data - which is confusing and
unexpected.

If the add() callback fails, then do not call any more client ops for the
device, even remove.

Remove all the now redundant checks for NULL client_data in ops callbacks.

Update all the add() callbacks to return error codes
appropriately. EOPNOTSUPP is used for cases where the ULP does not support
the ib_device - eg because it only works with IB.

Link: https://lore.kernel.org/r/20200421172440.387069-1-leon@kernel.org
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-05-06 11:57:33 -03:00
Florian Westphal
2ab6096db2 xfrm: remove output_finish indirection from xfrm_state_afinfo
There are only two implementaions, one for ipv4 and one for ipv6.

Both are almost identical, they clear skb->cb[], set the TRANSFORMED flag
in IP(6)CB and then call the common xfrm_output() function.

By placing the IPCB handling into the common function, we avoid the need
for the output_finish indirection as the output functions can simply
use xfrm_output().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
f3075f48dd xfrm: remove extract_output indirection from xfrm_state_afinfo
Move this to xfrm_output.c.  This avoids the state->extract_output
indirection.

This patch also removes the duplicated __xfrm6_extract_header helper
added in an earlier patch, we can now use the one from xfrm_inout.h .

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
ede64dd2bf xfrm: place xfrm6_local_dontfrag in xfrm.h
so next patch can re-use it from net/xfrm/xfrm_output.c without
causing a linker error when IPV6 is a module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
3e50ddd8b8 xfrm: expose local_rxpmtu via ipv6_stubs
We cannot call this function from the core kernel unless we would force
CONFIG_IPV6=y.

Therefore expose this via ipv6_stubs so we can call it from net/xfrm
in the followup patch.

Since the call is expected to be unlikely, no extra code for the IPV6=y
case is added and we will always eat the indirection cost.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
171916cbd5 xfrm: move xfrm4_extract_header to common helper
The function only initializes the XFRM CB in the skb.

After previous patch xfrm4_extract_header is only called from
net/xfrm/xfrm_{input,output}.c.

Because of IPV6=m linker errors the ipv6 equivalent
(xfrm6_extract_header) was already placed in xfrm_inout.h because
we can't call functions residing in a module from the core.

So do the same for the ipv4 helper and place it next to the ipv6 one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
a269fbfc4e xfrm: state: remove extract_input indirection from xfrm_state_afinfo
In order to keep CONFIG_IPV6=m working, xfrm6_extract_header needs to be
duplicated.  It will be removed again in a followup change when the
remaining caller is moved to net/xfrm as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
6d64be3da2 xfrm: avoid extract_output indirection for ipv4
We can use a direct call for ipv4, so move the needed functions
to net/xfrm/xfrm_output.c and call them directly.

For ipv6 the indirection can be avoided as well but it will need
a bit more work -- to ease review it will be done in another patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
John Fastabend
81aabbb9fb bpf, sockmap: bpf_tcp_ingress needs to subtract bytes from sg.size
In bpf_tcp_ingress we used apply_bytes to subtract bytes from sg.size
which is used to track total bytes in a message. But this is not
correct because apply_bytes is itself modified in the main loop doing
the mem_charge.

Then at the end of this we have sg.size incorrectly set and out of
sync with actual sk values. Then we can get a splat if we try to
cork the data later and again try to redirect the msg to ingress. To
fix instead of trying to track msg.size do the easy thing and include
it as part of the sk_msg_xfer logic so that when the msg is moved the
sg.size is always correct.

To reproduce the below users will need ingress + cork and hit an
error path that will then try to 'free' the skmsg.

[  173.699981] BUG: KASAN: null-ptr-deref in sk_msg_free_elem+0xdd/0x120
[  173.699987] Read of size 8 at addr 0000000000000008 by task test_sockmap/5317

[  173.700000] CPU: 2 PID: 5317 Comm: test_sockmap Tainted: G          I       5.7.0-rc1+ #43
[  173.700005] Hardware name: Dell Inc. Precision 5820 Tower/002KVM, BIOS 1.9.2 01/24/2019
[  173.700009] Call Trace:
[  173.700021]  dump_stack+0x8e/0xcb
[  173.700029]  ? sk_msg_free_elem+0xdd/0x120
[  173.700034]  ? sk_msg_free_elem+0xdd/0x120
[  173.700042]  __kasan_report+0x102/0x15f
[  173.700052]  ? sk_msg_free_elem+0xdd/0x120
[  173.700060]  kasan_report+0x32/0x50
[  173.700070]  sk_msg_free_elem+0xdd/0x120
[  173.700080]  __sk_msg_free+0x87/0x150
[  173.700094]  tcp_bpf_send_verdict+0x179/0x4f0
[  173.700109]  tcp_bpf_sendpage+0x3ce/0x5d0

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/158861290407.14306.5327773422227552482.stgit@john-Precision-5820-Tower
2020-05-06 00:22:22 +02:00
John Fastabend
3e104c2381 bpf, sockmap: msg_pop_data can incorrecty set an sge length
When sk_msg_pop() is called where the pop operation is working on
the end of a sge element and there is no additional trailing data
and there _is_ data in front of pop, like the following case,

   |____________a_____________|__pop__|

We have out of order operations where we incorrectly set the pop
variable so that instead of zero'ing pop we incorrectly leave it
untouched, effectively. This can cause later logic to shift the
buffers around believing it should pop extra space. The result is
we have 'popped' more data then we expected potentially breaking
program logic.

It took us a while to hit this case because typically we pop headers
which seem to rarely be at the end of a scatterlist elements but
we can't rely on this.

Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/158861288359.14306.7654891716919968144.stgit@john-Precision-5820-Tower
2020-05-06 00:22:15 +02:00
Roman Mashak
38212bb31f neigh: send protocol value in neighbor create notification
When a new neighbor entry has been added, event is generated but it does not
include protocol, because its value is assigned after the event notification
routine has run, so move protocol assignment code earlier.

Fixes: df9b0e30d4 ("neighbor: Add protocol attribute")
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 13:38:59 -07:00
William Tu
f989d546a2 erspan: Add type I version 0 support.
The Type I ERSPAN frame format is based on the barebones
IP + GRE(4-byte) encapsulation on top of the raw mirrored frame.
Both type I and II use 0x88BE as protocol type. Unlike type II
and III, no sequence number or key is required.
To creat a type I erspan tunnel device:
  $ ip link add dev erspan11 type erspan \
            local 172.16.1.100 remote 172.16.1.200 \
            erspan_ver 0

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 13:23:29 -07:00
YueHaibing
fea805237d net/smc: remove unused inline function smc_curs_read
commit bac6de7b63 ("net/smc: eliminate cursor read and write calls")
left behind this.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 12:56:52 -07:00
Karsten Graul
0a99be434d net/smc: log important pnetid and state change events
Print to system log when SMC links are available or go down, link group
state changes or pnetids are applied to and removed from devices.
The log entries are triggered by either user configuration actions or
adapter activation/deactivation events and are not expected to happen
often. The entries help SMC users to keep track of the SMC link group
status and to detect when actions are needed (like to add replacements
for failed adapters).

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 12:56:52 -07:00
David S. Miller
fe121e078d sch_choke: Remove classid from choke_skb_cb.
Suggested by Cong Wang.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 12:08:02 -07:00
YueHaibing
592138a88d net: sched: choke: Remove unused inline function choke_set_classid
There's no callers in-tree anymore since commit 5952fde10c ("net:
sched: choke: remove dead filter classify code")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 12:07:43 -07:00
Magnus Karlsson
07bf2d97d1 xsk: Remove unnecessary member in xdp_umem
Remove the unnecessary member of address in struct xdp_umem as it is
only used during the umem registration. No need to carry this around
as it is not used during run-time nor when unregistering the umem.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1588599232-24897-3-git-send-email-magnus.karlsson@intel.com
2020-05-04 22:56:26 +02:00
Magnus Karlsson
e4e5aefc11 xsk: Change two variable names for increased clarity
Change two variables names so that it is clearer what they
represent. The first one is xsk_list that in fact only contains the
list of AF_XDP sockets with a Tx component. Change this to xsk_tx_list
for improved clarity. The second variable is size in the ring
structure. One might think that this is the size of the ring, but it
is in fact the size of the umem, copied into the ring structure to
improve performance. Rename this variable umem_size to avoid any
confusion.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1588599232-24897-2-git-send-email-magnus.karlsson@intel.com
2020-05-04 22:56:26 +02:00
Cong Wang
1a33e10e4a net: partially revert dynamic lockdep key changes
This patch reverts the folowing commits:

commit 064ff66e2b
"bonding: add missing netdev_update_lockdep_key()"

commit 53d374979e
"net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()"

commit 1f26c0d3d2
"net: fix kernel-doc warning in <linux/netdevice.h>"

commit ab92d68fc2
"net: core: add generic lockdep keys"

but keeps the addr_list_lock_key because we still lock
addr_list_lock nestedly on stack devices, unlikely xmit_lock
this is safe because we don't take addr_list_lock on any fast
path.

Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 12:05:56 -07:00
Cong Wang
8d9f73c0ad atm: fix a memory leak of vcc->user_back
In lec_arp_clear_vccs() only entry->vcc is freed, but vcc
could be installed on entry->recv_vcc too in lec_vcc_added().

This fixes the following memory leak:

unreferenced object 0xffff8880d9266b90 (size 16):
  comm "atm2", pid 425, jiffies 4294907980 (age 23.488s)
  hex dump (first 16 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 6b 6b 6b a5  ............kkk.
  backtrace:
    [<(____ptrval____)>] kmem_cache_alloc_trace+0x10e/0x151
    [<(____ptrval____)>] lane_ioctl+0x4b3/0x569
    [<(____ptrval____)>] do_vcc_ioctl+0x1ea/0x236
    [<(____ptrval____)>] svc_ioctl+0x17d/0x198
    [<(____ptrval____)>] sock_do_ioctl+0x47/0x12f
    [<(____ptrval____)>] sock_ioctl+0x2f9/0x322
    [<(____ptrval____)>] vfs_ioctl+0x1e/0x2b
    [<(____ptrval____)>] ksys_ioctl+0x61/0x80
    [<(____ptrval____)>] __x64_sys_ioctl+0x16/0x19
    [<(____ptrval____)>] do_syscall_64+0x57/0x65
    [<(____ptrval____)>] entry_SYSCALL_64_after_hwframe+0x49/0xb3

Cc: Gengming Liu <l.dmxcsnsbh@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:59:38 -07:00
Cong Wang
93a2014afb atm: fix a UAF in lec_arp_clear_vccs()
Gengming reported a UAF in lec_arp_clear_vccs(),
where we add a vcc socket to an entry in a per-device
list but free the socket without removing it from the
list when vcc->dev is NULL.

We need to call lec_vcc_close() to search and remove
those entries contain the vcc being destroyed. This can
be done by calling vcc->push(vcc, NULL) unconditionally
in vcc_destroy_socket().

Another issue discovered by Gengming's reproducer is
the vcc->dev may point to the static device lecatm_dev,
for which we don't need to register/unregister device,
so we can just check for vcc->dev->ops->owner.

Reported-by: Gengming Liu <l.dmxcsnsbh@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:59:20 -07:00
Jakub Kicinski
043b3e2276 devlink: let kernel allocate region snapshot id
Currently users have to choose a free snapshot id before
calling DEVLINK_CMD_REGION_NEW. This is potentially racy
and inconvenient.

Make the DEVLINK_ATTR_REGION_SNAPSHOT_ID optional and try
to allocate id automatically. Send a message back to the
caller with the snapshot info.

Example use:
$ devlink region new netdevsim/netdevsim1/dummy
netdevsim/netdevsim1/dummy: snapshot 1

$ id=$(devlink -j region new netdevsim/netdevsim1/dummy | \
       jq '.[][][][]')
$ devlink region dump netdevsim/netdevsim1/dummy snapshot $id
[...]
$ devlink region del netdevsim/netdevsim1/dummy snapshot $id

v4:
 - inline the notification code
v3:
 - send the notification only once snapshot creation completed.
v2:
 - don't wrap the line containing extack;
 - add a few sentences to the docs.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:58:31 -07:00
Jakub Kicinski
dd86fec7e0 devlink: factor out building a snapshot notification
We'll need to send snapshot info back on the socket
which requested a snapshot to be created. Factor out
constructing a snapshot description from the broadcast
notification code.

v3: new patch

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:58:31 -07:00
Eric Dumazet
39d010504e net_sched: sch_fq: add horizon attribute
QUIC servers would like to use SO_TXTIME, without having CAP_NET_ADMIN,
to efficiently pace UDP packets.

As far as sch_fq is concerned, we need to add safety checks, so
that a buggy application does not fill the qdisc with packets
having delivery time far in the future.

This patch adds a configurable horizon (default: 10 seconds),
and a configurable policy when a packet is beyond the horizon
at enqueue() time:
- either drop the packet (default policy)
- or cap its delivery time to the horizon.

$ tc -s -d qd sh dev eth0
qdisc fq 8022: root refcnt 257 limit 10000p flow_limit 100p buckets 1024
 orphan_mask 1023 quantum 10Kb initial_quantum 51160b low_rate_threshold 550Kbit
 refill_delay 40.0ms timer_slack 10.000us horizon 10.000s
 Sent 1234215879 bytes 837099 pkt (dropped 21, overlimits 0 requeues 6)
 backlog 0b 0p requeues 6
  flows 1191 (inactive 1177 throttled 0)
  gc 0 highprio 0 throttled 692 latency 11.480us
  pkts_too_long 0 alloc_errors 0 horizon_drops 21 horizon_caps 0

v2: fixed an overflow on 32bit kernels in fq_init(), reported
    by kbuild test robot <lkp@intel.com>

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:56:17 -07:00
Cong Wang
a7df4870d7 net_sched: fix tcm_parent in tc filter dump
When we tell kernel to dump filters from root (ffff:ffff),
those filters on ingress (ffff:0000) are matched, but their
true parents must be dumped as they are. However, kernel
dumps just whatever we tell it, that is either ffff:ffff
or ffff:0000:

 $ nl-cls-list --dev=dummy0 --parent=root
 cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all
 cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all
 $ nl-cls-list --dev=dummy0 --parent=ffff:
 cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all
 cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all

This is confusing and misleading, more importantly this is
a regression since 4.15, so the old behavior must be restored.

And, when tc filters are installed on a tc class, the parent
should be the classid, rather than the qdisc handle. Commit
edf6711c98 ("net: sched: remove classid and q fields from tcf_proto")
removed the classid we save for filters, we can just restore
this classid in tcf_block.

Steps to reproduce this:
 ip li set dev dummy0 up
 tc qd add dev dummy0 ingress
 tc filter add dev dummy0 parent ffff: protocol arp basic action pass
 tc filter show dev dummy0 root

Before this patch:
 filter protocol arp pref 49152 basic
 filter protocol arp pref 49152 basic handle 0x1
	action order 1: gact action pass
	 random type none pass val 0
	 index 1 ref 1 bind 1

After this patch:
 filter parent ffff: protocol arp pref 49152 basic
 filter parent ffff: protocol arp pref 49152 basic handle 0x1
 	action order 1: gact action pass
 	 random type none pass val 0
	 index 1 ref 1 bind 1

Fixes: a10fa20101 ("net: sched: propagate q and parent from caller down to tcf_fill_node")
Fixes: edf6711c98 ("net: sched: remove classid and q fields from tcf_proto")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:53:33 -07:00
Jesper Dangaard Brouer
bf6dba76d2 net: sched: fallback to qdisc noqueue if default qdisc setup fail
Currently if the default qdisc setup/init fails, the device ends up with
qdisc "noop", which causes all TX packets to get dropped.

With the introduction of sysctl net/core/default_qdisc it is possible
to change the default qdisc to be more advanced, which opens for the
possibility that Qdisc_ops->init() can fail.

This patch detect these kind of failures, and choose to fallback to
qdisc "noqueue", which is so simple that its init call will not fail.
This allows the interface to continue functioning.

V2:
As this also captures memory failures, which are transient, the
device is not kept in IFF_NO_QUEUE state.  This allows the net_device
to retry to default qdisc assignment.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:50:51 -07:00
Karsten Graul
649758fff3 net/smc: save SMC-R peer link_uid
During SMC-R link establishment the peers exchange the link_uid that
is used for debugging purposes. Save the peer link_uid in smc_link so it
can be retrieved by the smc_diag netlink interface.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
45fa8da0bf net/smc: create improved SMC-R link_uid
The link_uid of an SMC-R link is exchanged between SMC peers and its
value can be used for debugging purposes. Create a unique link_uid
during link initialization and use it in communication with SMC-R peers.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
a52bcc919b net/smc: improve termination processing
Add helper smcr_lgr_link_deactivate_all() and eliminate duplicate code.
In smc_lgr_free(), clear the smc-r links before smc_lgr_free_bufs() is
called so buffers are already prepared for free. The usage of the soft
parameter in __smc_lgr_terminate() is no longer needed, smc_lgr_free()
can be called directly. smc_lgr_terminate_sched() and
smc_smcd_terminate() set lgr->freeing to indicate that the link group
will be freed soon to avoid unnecessary schedules of the free worker.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
3e0c40afce net/smc: add termination reason and handle LLC protocol violation
Allow to set the reason code for the link group termination, and set
meaningful values before termination processing is triggered. This
reason code is sent to the peer in the final delete link message.
When the LLC request or response layer receives a message type that was
not handled, drop a warning and terminate the link group.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
ad6c111b8a net/smc: asymmetric link tagging
New connections must not be assigned to asymmetric links. Add asymmetric
link tagging using new link variable link_is_asym. The new helpers
smcr_lgr_set_type() and smcr_lgr_set_type_asym() are called to set the
state of the link group, and tag all links accordingly.
smcr_lgr_conn_assign_link() respects the link tagging and will not
assign new connections to links tagged as asymmetric link.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
56bc3b2094 net/smc: assign link to a new connection
For new connections, assign a link from the link group, using some
simple load balancing.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
f3811fd7bc net/smc: send DELETE_LINK, ALL message and wait for send to complete
Add smc_llc_send_message_wait() which uses smc_wr_tx_send_wait() to send
an LLC message and waits for the message send to complete.
smc_llc_send_link_delete_all() calls the new function to send an
DELETE_LINK,ALL LLC message. The RFC states that the sender of this type
of message needs to wait for the completion event of the message
transmission and can terminate the link afterwards.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
09c61d24f9 net/smc: wait for departure of an IB message
Introduce smc_wr_tx_send_wait() to send an IB message and wait for the
tx completion event of the message. This makes sure that the message is
no longer in-flight when the function returns.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
b286a0651e net/smc: handle incoming CDC validation message
Call smc_cdc_msg_validate() when a CDC message with the failover
validation bit enabled was received. Validate that the sequence number
sent with the message is one we already have received. If not, messages
were lost and the connection is terminated using a new abort_work.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
29bd73dba4 net/smc: send failover validation message
When a connection is switched to a new link then a link validation
message must be sent to the peer over the new link, containing the
sequence number of the last CDC message that was sent over the old link.
The peer will validate if this sequence number is the same or lower then
the number he received, and abort the connection if messages were lost.
Add smcr_cdc_msg_send_validation() to send the message validation
message and call it when a connection was switched in
smc_switch_cursor().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
c6f02ebeea net/smc: switch connections to alternate link
Add smc_switch_conns() to switch all connections from a link that is
going down. Find an other link to switch the connections to, and
switch each connection to the new link. smc_switch_cursor() updates the
cursors of a connection to the state of the last successfully sent CDC
message. When there is no link to switch to, terminate the link group.
Call smc_switch_conns() when a link is going down.
And with the possibility that links of connections can switch adapt CDC
and TX functions to detect and handle link switches.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Karsten Graul
f0ec4f1d32 net/smc: save state of last sent CDC message
When a link goes down and all connections of this link need to be
switched to an other link then the producer cursor and the sequence of
the last successfully sent CDC message must be known. Add the two fields
to the SMC connection and update it in the tx completion handler.
And to allow matching of sequences in error cases reset the seqno to the
old value in smc_cdc_msg_send() when the actual send failed.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:54:39 -07:00
Aya Levin
bea0c5c942 devlink: Fix reporter's recovery condition
Devlink health core conditions the reporter's recovery with the
expiration of the grace period. This is not relevant for the first
recovery. Explicitly demand that the grace period will only apply to
recoveries other than the first.

Fixes: c8e1da0bf9 ("devlink: Add health report functionality")
Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:40:39 -07:00
Tuong Lien
980d69276f tipc: fix partial topology connection closure
When an application connects to the TIPC topology server and subscribes
to some services, a new connection is created along with some objects -
'tipc_subscription' to store related data correspondingly...
However, there is one omission in the connection handling that when the
connection or application is orderly shutdown (e.g. via SIGQUIT, etc.),
the connection is not closed in kernel, the 'tipc_subscription' objects
are not freed too.
This results in:
- The maximum number of subscriptions (65535) will be reached soon, new
subscriptions will be rejected;
- TIPC module cannot be removed (unless the objects  are somehow forced
to release first);

The commit fixes the issue by closing the connection if the 'recvmsg()'
returns '0' i.e. when the peer is shutdown gracefully. It also includes
the other unexpected cases.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:31:13 -07:00
Florian Fainelli
86f8b1c01a net: dsa: Do not make user port errors fatal
Prior to 1d27732f41 ("net: dsa: setup and teardown ports"), we would
not treat failures to set-up an user port as fatal, but after this
commit we would, which is a regression for some systems where interfaces
may be declared in the Device Tree, but the underlying hardware may not
be present (pluggable daughter cards for instance).

Fixes: 1d27732f41 ("net: dsa: setup and teardown ports")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 10:29:47 -07:00
Karsten Graul
4dadd151b2 net/smc: enqueue local LLC messages
As SMC server, when a second link was deleted, trigger the setup of an
asymmetric link. Do this by enqueueing a local ADD_LINK message which
is processed by the LLC layer as if it were received from peer. Do the
same when a new IB port became active and a new link could be created.
smc_llc_srv_add_link_local() enqueues a local ADD_LINK message.
And smc_llc_srv_delete_link_local() is used the same way to enqueue a
local DELETE_LINK message. This is used when an IB port is no longer
active.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:08:07 -07:00
Karsten Graul
08ae27ddfb net/smc: delete link processing as SMC server
Add smc_llc_process_srv_delete_link() to process a DELETE_LINK request
as SMC server. When the request is to delete ALL links then terminate
the whole link group. If not, find the link to delete by its link_id,
send the DELETE_LINK request LLC message and wait for the response.
No matter if a response was received, clear the deleted link and update
the link group state.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:07:56 -07:00
Karsten Graul
9c4168789c net/smc: delete link processing as SMC client
Add smc_llc_process_cli_delete_link() to process a DELETE_LINK request
as SMC client. When the request is to delete ALL links then terminate
the whole link group. If not, find the link to delete by its link_id,
send the DELETE_LINK response LLC message and then clear the deleted
link. Finally determine and update the link group state.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:07:39 -07:00
Karsten Graul
9ec6bf19ec net/smc: llc_del_link_work and use the LLC flow for delete link
Introduce a work that is scheduled when a new DELETE_LINK LLC request is
received. The work will call either the SMC client or SMC server
DELETE_LINK processing.
And use the LLC flow framework to process incoming DELETE_LINK LLC
messages, scheduling the llc_del_link_work for those events.
With these changes smc_lgr_forget() is only called by one function and
can be migrated into smc_lgr_cleanup_early().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:07:23 -07:00
Karsten Graul
c9a5d24303 net/smc: delete an asymmetric link as SMC server
When a link group moved from asymmetric to symmetric state then the
dangling asymmetric link can be deleted. Add smc_llc_find_asym_link() to
find the respective link and add smc_llc_delete_asym_link() to delete
it.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:07:09 -07:00
Karsten Graul
1551c95b61 net/smc: final part of add link processing as SMC server
This patch finalizes the ADD_LINK processing of new links. Send the
CONFIRM_LINK request to the peer, receive the response and set link
state to ACTIVE.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:07:01 -07:00
Karsten Graul
57b499242c net/smc: rkey processing for a new link as SMC server
Part of SMC server new link establishment is the exchange of rkeys for
used buffers.
Loop over all used RMB buffers and send ADD_LINK_CONTINUE LLC messages
to the peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:06:51 -07:00
Karsten Graul
2d2209f201 net/smc: first part of add link processing as SMC server
First set of functions to process an ADD_LINK LLC request as an SMC
server. Find an alternate IB device, determine the new link group type
and get the index for the new link. Then initialize the link and send
the ADD_LINK LLC message to the peer. Save the contents of the response,
ready the link, map all used buffers and register the buffers with the
IB device. If any error occurs, stop the processing and clear the link.
And call smc_llc_srv_add_link() in af_smc.c to start second link
establishment after the initial link of a link group was created.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:06:40 -07:00
Karsten Graul
b1570a87f5 net/smc: final part of add link processing as SMC client
This patch finalizes the ADD_LINK processing of new links. Receive the
CONFIRM_LINK request from peer, complete the link initialization,
register all used buffers with the IB device and finally send the
CONFIRM_LINK response, which completes the ADD_LINK processing.
And activate smc_llc_cli_add_link() in af_smc.c.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:06:25 -07:00
Karsten Graul
87f88cda21 net/smc: rkey processing for a new link as SMC client
Part of the SMC client new link establishment process is the exchange of
rkeys for all used buffers.
Add new LLC message type ADD_LINK_CONTINUE which is used to exchange
rkeys of all current RMB buffers. Add functions to iterate over all
used RMB buffers of the link group, and implement the ADD_LINK_CONTINUE
processing.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:06:09 -07:00
Karsten Graul
336ba09f2e net/smc: first part of add link processing as SMC client
First set of functions to process an ADD_LINK LLC request as an SMC
client. Find an alternate IB device, determine the new link group type
and get the index for the new link. Then ready the link, map the buffers
and send an ADD_LINK LLC response. If any error occurs, send a reject
LLC message and terminate the processing.
Add smc_llc_alloc_alt_link() to find a free link index for a new link,
depending on the new link group type.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 16:05:15 -07:00
Eric Dumazet
2761121af8 net_sched: sch_skbprio: add message validation to skbprio_change()
Do not assume the attribute has the right size.

Fixes: aea5f654e6 ("net/sched: add skbprio scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 15:52:13 -07:00
Eric Dumazet
348e289b0f net_sched: sch_fq: perform a prefetch() earlier
The prefetch() done in fq_dequeue() can be done a bit earlier
after the refactoring of the code done in the prior patch.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 15:50:45 -07:00
Eric Dumazet
c288b0ca86 net_sched: sch_fq: do not call fq_peek() twice per packet
This refactors the code to not call fq_peek() from fq_dequeue_head()
since the caller can provide the skb.

Also rename fq_dequeue_head() to fq_dequeue_skb() because 'head' is
a bit vague, given the skb could come from t_root rb-tree.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 15:50:45 -07:00
Eric Dumazet
82a0aa53b5 net_sched: sch_fq: use bulk freeing in fq_gc()
fq_gc() already builds a small array of pointers, so using
kmem_cache_free_bulk() needs very little change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 15:50:31 -07:00
Eric Dumazet
7ba0537c2b net_sched: sch_fq: change fq_flow size/layout
sizeof(struct fq_flow) is 112 bytes on 64bit arches.

This means that half of them use two cache lines, but 50% use
three cache lines.

This patch adds cache line alignment, and makes sure that only
the first cache line is touched by fq_enqueue(), which is more
expensive that fq_dequeue() in general.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 15:50:31 -07:00
Eric Dumazet
dde0a648fc net_sched: sch_fq: avoid touching f->next from fq_gc()
A significant amount of cpu cycles is spent in fq_gc()

When fq_gc() does its lookup in the rb-tree, it needs the
following fields from struct fq_flow :

f->sk       (lookup key in the rb-tree)
f->fq_node  (anchor in the rb-tree)
f->next     (used to determine if the flow is detached)
f->age      (used to determine if the flow is candidate for gc)

This unfortunately spans two cache lines (assuming 64 bytes cache lines)

We can avoid using f->next, if we use the low order bit of f->{age|tail}

This low order bit is 0, if f->tail points to an sk_buff.
We set the low order bit to 1, if the union contains a jiffies value.

Combined with the following patch, this makes sure we only need
to bring into cpu caches one cache line per flow.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-03 15:50:31 -07:00
Dmitry Yakunin
ee1bd483cc inet_diag: bc: read cgroup id only for full sockets
Fix bug introduced by commit b1f3e43dbf ("inet_diag: add support for
cgroup filter").

Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reported-by: syzbot+ee80f840d9bf6893223b@syzkaller.appspotmail.com
Reported-by: syzbot+13bef047dbfffa5cd1af@syzkaller.appspotmail.com
Fixes: b1f3e43dbf ("inet_diag: add support for cgroup filter")
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02 16:42:51 -07:00
David S. Miller
40b94224c3 smc: Remove unused function.
net/smc/smc_llc.c:544:12: warning: ‘smc_llc_alloc_alt_link’ defined but not used [-Wunused-function]
 static int smc_llc_alloc_alt_link(struct smc_link_group *lgr,
            ^~~~~~~~~~~~~~~~~~~~~~

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-02 16:41:06 -07:00
Jason Gunthorpe
451c9fc77a Merge branch 'mellanox/mlx5-next' into rdma.git for-next
From the mlx5-next branch at
  git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux

Required for dependencies in following patches

* mellanox/mlx5-next:
  net/mlx5: Add support to get lag physical port
  net/mlx5: Change lag mutex lock to spin lock
  bonding: Implement ndo_get_xmit_slave
  bonding: Add array of all slaves
  bonding: Add function to get the xmit slave in active-backup mode
  bonding: Add helper function to get the xmit slave in rr mode
  bonding: Add helper function to get the xmit slave based on hash
  bonding/alb: Add helper functions to get the xmit slave
  bonding: Rename slave_arr to usable_slaves
  bonding: Export skip slave logic to function
  net/core: Introduce netdev_get_xmit_slave
2020-05-02 20:18:45 -03:00
Linus Torvalds
29a47f456d NFS client bugfixes for Linux 5.7
Highlights include:
 
 Stable fixes
 - fix handling of backchannel binding in BIND_CONN_TO_SESSION
 
 Bugfixes
 - Fix a credential use-after-free issue in pnfs_roc()
 - Fix potential posix_acl refcnt leak in nfs3_set_acl
 - defer slow parts of rpc_free_client() to a workqueue
 - Fix an Oopsable race in __nfs_list_for_each_server()
 - Fix trace point use-after-free race
 - Regression: the RDMA client no longer responds to server disconnect requests
 - Fix return values of xdr_stream_encode_item_{present, absent}
 - _pnfs_return_layout() must always wait for layoutreturn completion
 
 Cleanups
 - Remove unreachable error conditions
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6tczsACgkQZwvnipYK
 APKHWg//QGx2Tolj5dh2jBHa47A5/SYnJxCZAA0/fWdwRtFkW3HyyGne1jU86do2
 SMAVpBpri1WJPt5d3DH66gu4l4UxG1h84s7QP4lGfSa85EmtLh+LoZQCZRqYoDOo
 JAMzWctELu1TUpaa1N5Dhg/qMtMy6ulRMWgzTLqB9a/pQa3onugTK6W7xiut2prj
 PBfFq7N9XXmPboSeGV9bR4L8XKSbTCLEt3U1F2zAGU7UUINvDfpjEXq7BHYCewKL
 ObPW6EWZksyna16H8i/xGWoKgE4JFVjMwQAP7UdDBi+FW9RI6UpTBoR6z9N748j0
 jEocDbI21wgnwmtrVTbzsYm6ttHl4D4egoNxn7m5zjxTU4Ba/RQG2aaHUGFOYpJj
 1FI1f6V1Y5v4mJajdsEH+pGW/4vK/4YMR+7YHJ/hYU/WiXjLf7onIIifdWt4SQdo
 lvZbGcx6IAHYUA4lI7hkcvrK4bbqAnPLFq28nlUWEID5q5D+nA1ZR9iN0FToviDy
 FYyhQzyfD1kt98SV1DjWUqvDDd6IB64iDZTXGmtWvj6c2nbezGiFffvtzUL5LFxY
 QfI8lkpmUyt1EiWlZWhtOh4zsiM5yMZkJB/3RJv3RMmswizSSAHdgCKWhdLpX0bl
 TG1L8yEmcTc5ANS37EhlpcBNbfYw7oIF/OXuReTSRoMQl5hxjfY=
 =w0zk
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - fix handling of backchannel binding in BIND_CONN_TO_SESSION

  Bugfixes:
   - Fix a credential use-after-free issue in pnfs_roc()
   - Fix potential posix_acl refcnt leak in nfs3_set_acl
   - defer slow parts of rpc_free_client() to a workqueue
   - Fix an Oopsable race in __nfs_list_for_each_server()
   - Fix trace point use-after-free race
   - Regression: the RDMA client no longer responds to server disconnect
     requests
   - Fix return values of xdr_stream_encode_item_{present, absent}
   - _pnfs_return_layout() must always wait for layoutreturn completion

  Cleanups:
   - Remove unreachable error conditions"

* tag 'nfs-for-5.7-4' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS: Fix a race in __nfs_list_for_each_server()
  NFSv4.1: fix handling of backchannel binding in BIND_CONN_TO_SESSION
  SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
  NFSv4: Remove unreachable error condition due to rpc_run_task()
  SUNRPC: Remove unreachable error condition
  xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
  xprtrdma: Fix trace point use-after-free race
  xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
  nfs: Fix potential posix_acl refcnt leak in nfs3_set_acl
  NFS/pnfs: Fix a credential use-after-free issue in pnfs_roc()
  NFS/pnfs: Ensure that _pnfs_return_layout() waits for layoutreturn completion
2020-05-02 11:24:01 -07:00
David S. Miller
115506fea4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-01 (v2)

The following pull-request contains BPF updates for your *net-next* tree.

We've added 61 non-merge commits during the last 6 day(s) which contain
a total of 153 files changed, 6739 insertions(+), 3367 deletions(-).

The main changes are:

1) pulled work.sysctl from vfs tree with sysctl bpf changes.

2) bpf_link observability, from Andrii.

3) BTF-defined map in map, from Andrii.

4) asan fixes for selftests, from Andrii.

5) Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH, from Jakub.

6) production cloudflare classifier as a selftes, from Lorenz.

7) bpf_ktime_get_*_ns() helper improvements, from Maciej.

8) unprivileged bpftool feature probe, from Quentin.

9) BPF_ENABLE_STATS command, from Song.

10) enable bpf_[gs]etsockopt() helpers for sock_ops progs, from Stanislav.

11) enable a bunch of common helpers for cg-device, sysctl, sockopt progs,
 from Stanislav.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 17:02:27 -07:00
Karsten Graul
b45e7f98ab net/smc: llc_add_link_work to handle ADD_LINK LLC requests
Introduce a work that is scheduled when a new ADD_LINK LLC request is
received. The work will call either the SMC client or SMC server
ADD_LINK processing.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:05 -07:00
Karsten Graul
8574cf4055 net/smc: allocate index for a new link
Add smc_llc_alloc_alt_link() to find a free link index for a new link,
depending on the new link group type. And update constants for the
maximum number of links to 3 (2 symmetric and 1 dangling asymmetric link).
These maximum numbers are the same as used by other implementations of the
SMC-R protocol.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:05 -07:00
Karsten Graul
6c868a3edc net/smc: introduce smc_pnet_find_alt_roce()
Introduce a new function in smc_pnet.c that searches for an alternate
IB device, using an existing link group and a primary IB device. The
alternate IB device needs to be active and must have the same PNETID
as the link group.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:05 -07:00
Karsten Graul
33d203302d net/smc: remove DELETE LINK processing from smc_core.c
Support for multiple links makes the former DELETE LINK processing
obsolete which sent one DELETE_LINK LLC message for each single link.
Remove this processing from smc_core.c.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
87523930a1 net/smc: take link down instead of terminating the link group
Use the introduced link down processing in all places where the link
group is terminated and take down the affected link only.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
541afa10c1 net/smc: add smcr_port_err() and smcr_link_down() processing
Call smcr_port_err() when an IB event reports an inactive IB device.
smcr_port_err() calls smcr_link_down() for all affected links.
smcr_link_down() either triggers the local DELETE_LINK processing, or
sends an DELETE_LINK LLC message to the SMC server to initiate the
processing.
The old handler function smc_port_terminate() is removed.
Add helper smcr_link_down_cond() to take a link down conditionally, and
smcr_link_down_cond_sched() to schedule the link_down processing to a
work.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
1f90a05d9f net/smc: add smcr_port_add() and smcr_link_up() processing
Call smcr_port_add() when an IB event reports a new active IB device.
smcr_port_add() will start a work which either triggers the local
ADD_LINK processing, or send an ADD_LINK LLC message to the SMC server
to initiate the processing.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
35dcf7ec02 net/smc: remember PNETID of IB device for later device matching
The PNETID is needed to find an alternate link for a link group.
Save the PNETID of the link that is used to create the link group for
later device matching.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
d550066776 net/smc: mutex to protect the lgr against parallel reconfigurations
Introduce llc_conf_mutex in the link group which is used to protect the
buffers and lgr states against parallel link reconfiguration.
This ensures that new connections do not start to register buffers with
the links of a link group when link creation or termination is running.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
fbed3b37c8 net/smc: extend smc_llc_send_add_link() and smc_llc_send_delete_link()
All LLC sends are done from worker context only, so remove the prep
functions which were used to build the message before it was sent, and
add the function content into the respective send function
smc_llc_send_add_link() and smc_llc_send_delete_link().
Extend smc_llc_send_add_link() to include the qp_mtu value in the LLC
message, which is needed to establish a link after the initial link was
created. Extend smc_llc_send_delete_link() to contain a link_id and a
reason code for the link deletion in the LLC message, which is needed
when a specific link should be deleted.
And add the list of existing DELETE_LINK reason codes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
fb33d27727 net/smc: map and register buffers for a new link
Introduce support to map and register all current buffers for a new
link. smcr_buf_map_lgr() will map used buffers for a new link and
smcr_buf_reg_lgr() can be called to register used buffers on the
IB device of the new link.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
4a3641c160 net/smc: unmapping of buffers to support multiple links
With the support of multiple links that are created and cleared there
is a need to unmap one link from all current buffers. Add unmapping by
link and by rmb. And make smcr_link_clear() available to be called from
the LLC layer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Karsten Graul
7562a13d5a net/smc: multiple link support for rmb buffer registration
The CONFIRM_RKEY LLC processing handles all links in one LLC message.
Move the call to this processing out of smcr_link_reg_rmb() which does
processing per link, into smcr_lgr_reg_rmbs() which is responsible for
link group level processing. Move smcr_link_reg_rmb() into module
smc_core.c.
>From af_smc.c now call smcr_lgr_reg_rmbs() to register new rmbs on all
available links.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:20:04 -07:00
Po Liu
d29bdd69ec net: schedule: add action gate offloading
Add the gate action to the flow action entry. Add the gate parameters to
the tc_setup_flow_action() queueing to the entries of flow_action_entry
array provide to the driver.

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:08:19 -07:00
Po Liu
a51c328df3 net: qos: introduce a gate control flow action
Introduce a ingress frame gate control flow action.
Tc gate action does the work like this:
Assume there is a gate allow specified ingress frames can be passed at
specific time slot, and be dropped at specific time slot. Tc filter
chooses the ingress frames, and tc gate action would specify what slot
does these frames can be passed to device and what time slot would be
dropped.
Tc gate action would provide an entry list to tell how much time gate
keep open and how much time gate keep state close. Gate action also
assign a start time to tell when the entry list start. Then driver would
repeat the gate entry list cyclically.
For the software simulation, gate action requires the user assign a time
clock type.

Below is the setting example in user space. Tc filter a stream source ip
address is 192.168.0.20 and gate action own two time slots. One is last
200ms gate open let frame pass another is last 100ms gate close let
frames dropped. When the ingress frames have reach total frames over
8000000 bytes, the excessive frames will be dropped in that 200000000ns
time slot.

> tc qdisc add dev eth0 ingress

> tc filter add dev eth0 parent ffff: protocol ip \
	   flower src_ip 192.168.0.20 \
	   action gate index 2 clockid CLOCK_TAI \
	   sched-entry open 200000000 -1 8000000 \
	   sched-entry close 100000000 -1 -1

> tc chain del dev eth0 ingress chain 0

"sched-entry" follow the name taprio style. Gate state is
"open"/"close". Follow with period nanosecond. Then next item is internal
priority value means which ingress queue should put. "-1" means
wildcard. The last value optional specifies the maximum number of
MSDU octets that are permitted to pass the gate during the specified
time interval.
Base-time is not set will be 0 as default, as result start time would
be ((N + 1) * cycletime) which is the minimal of future time.

Below example shows filtering a stream with destination mac address is
10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
action would run with one close time slot which means always keep close.
The time cycle is total 200000000ns. The base-time would calculate by:

 1357000000000 + (N + 1) * cycletime

When the total value is the future time, it will be the start time.
The cycletime here would be 200000000ns for this case.

> tc filter add dev eth0 parent ffff:  protocol ip \
	   flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
	   action gate index 12 base-time 1357000000000 \
	   sched-entry close 200000000 -1 -1 \
	   clockid CLOCK_TAI

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:08:19 -07:00
Arnd Bergmann
dc30b4059f drop_monitor: work around gcc-10 stringop-overflow warning
The current gcc-10 snapshot produces a false-positive warning:

net/core/drop_monitor.c: In function 'trace_drop_common.constprop':
cc1: error: writing 8 bytes into a region of size 0 [-Werror=stringop-overflow=]
In file included from net/core/drop_monitor.c:23:
include/uapi/linux/net_dropmon.h:36:8: note: at offset 0 to object 'entries' with size 4 declared here
   36 |  __u32 entries;
      |        ^~~~~~~

I reported this in the gcc bugzilla, but in case it does not get
fixed in the release, work around it by using a temporary variable.

Fixes: 9a8afc8d39 ("Network Drop Monitor: Adding drop monitor implementation & Netlink protocol")
Link: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94881
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 15:45:16 -07:00
Jakub Kicinski
610a9346c1 devlink: fix return value after hitting end in region read
Commit d5b90e99e1 ("devlink: report 0 after hitting end in region read")
fixed region dump, but region read still returns a spurious error:

$ devlink region read netdevsim/netdevsim1/dummy snapshot 0 addr 0 len 128
0000000000000000 a6 f4 c4 1c 21 35 95 a6 9d 34 c3 5b 87 5b 35 79
0000000000000010 f3 a0 d7 ee 4f 2f 82 7f c6 dd c4 f6 a5 c3 1b ae
0000000000000020 a4 fd c8 62 07 59 48 03 70 3b c7 09 86 88 7f 68
0000000000000030 6f 45 5d 6d 7d 0e 16 38 a9 d0 7a 4b 1e 1e 2e a6
0000000000000040 e6 1d ae 06 d6 18 00 85 ca 62 e8 7e 11 7e f6 0f
0000000000000050 79 7e f7 0f f3 94 68 bd e6 40 22 85 b6 be 6f b1
0000000000000060 af db ef 5e 34 f0 98 4b 62 9a e3 1b 8b 93 fc 17
devlink answers: Invalid argument
0000000000000070 61 e8 11 11 66 10 a5 f7 b1 ea 8d 40 60 53 ed 12

This is a minimal fix, I'll follow up with a restructuring
so we don't have two checks for the same condition.

Fixes: fdd41ec21e ("devlink: Return right error code in case of errors for region read")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 15:25:35 -07:00
Jesper Dangaard Brouer
41a46913be net: fix skb_panic to output real address
In skb_panic() the real pointer values are really needed to diagnose
issues, e.g. data and head are related (to calculate headroom). The
hashed versions of the addresses doesn't make much sense here. The
patch use the printk specifier %px to print the actual address.

The printk documentation on %px:
https://www.kernel.org/doc/html/latest/core-api/printk-formats.html#unmodified-addresses

Fixes: ad67b74d24 ("printk: hash addresses printed with %p")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 15:19:59 -07:00
Cambda Zhu
f0628c524f net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
This patch changes the behavior of TCP_LINGER2 about its limit. The
sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
as the limit of TCP_LINGER2, which is 2 minutes.

Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
and the limit in the past, the system administrator cannot set the
default value for most of sockets and let some sockets have a greater
timeout. It might be a mistake that let the sysctl to be the limit of
the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
and 2 minutes are legal considering TCP specs.

Changes in v3:
- Remove the new socket option and change the TCP_LINGER2 behavior so
  that the timeout can be set to value between sysctl_tcp_fin_timeout
  and 2 minutes.

Changes in v2:
- Add int overflow check for the new socket option.

Changes in v1:
- Add a new socket option to set timeout greater than
  sysctl_tcp_fin_timeout.

Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 15:12:14 -07:00
David Ahern
8f34e53b60 ipv6: Use global sernum for dst validation with nexthop objects
Nik reported a bug with pcpu dst cache when nexthop objects are
used illustrated by the following:
    $ ip netns add foo
    $ ip -netns foo li set lo up
    $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
    $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
    $ ip li add veth1 type veth peer name veth2
    $ ip li set veth1 up
    $ ip addr add 2001:db8:10::1/64 dev veth1
    $ ip li set dev veth2 netns foo
    $ ip -netns foo li set veth2 up
    $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
    $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Create a pcpu entry on cpu 0:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1

    Re-add the route entry:
    $ ip -6 ro del 2001:db8:11::1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Route get on cpu 0 returns the stale pcpu:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
    RTNETLINK answers: Network is unreachable

    While cpu 1 works:
    $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
    2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium

Conversion of FIB entries to work with external nexthop objects
missed an important difference between IPv4 and IPv6 - how dst
entries are invalidated when the FIB changes. IPv4 has a per-network
namespace generation id (rt_genid) that is bumped on changes to the FIB.
Checking if a dst_entry is still valid means comparing rt_genid in the
rtable to the current value of rt_genid for the namespace.

IPv6 also has a per network namespace counter, fib6_sernum, but the
count is saved per fib6_node. With the per-node counter only dst_entries
based on fib entries under the node are invalidated when changes are
made to the routes - limiting the scope of invalidations. IPv6 uses a
reference in the rt6_info, 'from', to track the corresponding fib entry
used to create the dst_entry. When validating a dst_entry, the 'from'
is used to backtrack to the fib6_node and check the sernum of it to the
cookie passed to the dst_check operation.

With the inline format (nexthop definition inline with the fib6_info),
dst_entries cached in the fib6_nh have a 1:1 correlation between fib
entries, nexthop data and dst_entries. With external nexthops, IPv6
looks more like IPv4 which means multiple fib entries across disparate
fib6_nodes can all reference the same fib6_nh. That means validation
of dst_entries based on external nexthops needs to use the IPv4 format
- the per-network namespace counter.

Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
rt6_get_cookie to return sernum if it is set and update dst_check for
IPv6 to look for sernum set and based the check on it if so. Finally,
rt6_get_pcpu_route needs to validate the cached entry before returning
a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
__mkroute_output for IPv4).

This problem only affects routes using the new, external nexthops.

Thanks to the kbuild test robot for catching the IS_ENABLED needed
around rt_genid_ipv6 before I sent this out.

Fixes: 5b98324ebe ("ipv6: Allow routes to use nexthop objects")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 12:46:30 -07:00
Stanislav Fomichev
beecf11bc2 bpf: Bpf_{g,s}etsockopt for struct bpf_sock_addr
Currently, bpf_getsockopt and bpf_setsockopt helpers operate on the
'struct bpf_sock_ops' context in BPF_PROG_TYPE_SOCK_OPS program.
Let's generalize them and make them available for 'struct bpf_sock_addr'.
That way, in the future, we can allow those helpers in more places.

As an example, let's expose those 'struct bpf_sock_addr' based helpers to
BPF_CGROUP_INET{4,6}_CONNECT hooks. That way we can override CC before the
connection is made.

v3:
* Expose custom helpers for bpf_sock_addr context instead of doing
  generic bpf_sock argument (as suggested by Daniel). Even with
  try_socket_lock that doesn't sleep we have a problem where context sk
  is already locked and socket lock is non-nestable.

v2:
* s/BPF_PROG_TYPE_CGROUP_SOCKOPT/BPF_PROG_TYPE_SOCK_OPS/

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200430233152.199403-1-sdf@google.com
2020-05-01 12:44:28 -07:00
Mauro Carvalho Chehab
c4ea03fdfd docs: networking: convert x25.txt to ReST
Not much to be done here:
- add SPDX header;
- add a document title;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 12:24:40 -07:00
Mauro Carvalho Chehab
883780af72 docs: networking: convert x25-iface.txt to ReST
Not much to be done here:

- add SPDX header;
- adjust title markup;
- remove a tail whitespace;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 12:24:40 -07:00
Maor Gottlieb
cff9f12b18 net/core: Introduce netdev_get_xmit_slave
Add new ndo to get the xmit slave of master device. The reference
counters are not incremented so the caller must be careful with locks.
User can ask to get the xmit slave assume all the slaves can
transmit by set all_slaves arg to true.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
David S. Miller
b6f875a8d9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Do not update the UDP checksum when it's zero, from Guillaume Nault.

2) Fix return of local variable in nf_osf, from Arnd Bergmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 18:07:16 -07:00
Johannes Berg
d07dcf9aad netlink: add infrastructure to expose policies to userspace
Add, and use in generic netlink, helpers to dump out a netlink
policy to userspace, including all the range validation data,
nested policies etc.

This lets userspace discover what the kernel understands.

For families/commands other than generic netlink, the helpers
need to be used directly in an appropriate command, or we can
add some infrastructure (a new netlink family) that those can
register their policies with for introspection. I'm not that
familiar with non-generic netlink, so that's left out for now.

The data exposed to userspace also includes min and max length
for binary/string data, I've done that instead of letting the
userspace tools figure out whether min/max is intended based
on the type so that we can extend this later in the kernel, we
might want to just use the range data for example.

Because of this, I opted to not directly expose the NLA_*
values, even if some of them are already exposed via BPF, as
with min/max length we don't need to have different types here
for NLA_BINARY/NLA_MIN_LEN/NLA_EXACT_LEN, we just make them
all NL_ATTR_TYPE_BINARY with min/max length optionally set.

Similarly, we don't really need NLA_MSECS, and perhaps can
remove it in the future - but not if we encode it into the
userspace API now. It gets mapped to NL_ATTR_TYPE_U64 here.

Note that the exposing here corresponds to the strict policy
interpretation, and NLA_UNSPEC items are omitted entirely.
To get those, change them to NLA_MIN_LEN which behaves in
exactly the same way, but is exposed.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
c7721c05a6 netlink: remove NLA_EXACT_LEN_WARN
Use a validation type instead, so we can later expose
the NLA_* values to userspace for policy descriptions.

Some transformations were done with this spatch:

    @@
    identifier p;
    expression X, L, A;
    @@
    struct nla_policy p[X] = {
    [A] =
    -{ .type = NLA_EXACT_LEN_WARN, .len = L },
    +NLA_POLICY_EXACT_LEN_WARN(L),
    ...
    };

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
d15da2a2e8 nl80211: link recursive netlink nested policy
Now that we have limited recursive policy validation to avoid
stack overflows, change nl80211 to actually link the nested
policy (linking back to itself eventually), which allows some
code cleanups.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:41 -07:00
Johannes Berg
47a1494b82 netlink: remove type-unsafe validation_data pointer
In the netlink policy, we currently have a void *validation_data
that's pointing to different things:
 * a u32 value for bitfield32,
 * the netlink policy for nested/nested array
 * the string for NLA_REJECT

Remove the pointer and place appropriate type-safe items in the
union instead.

While at it, completely dissolve the pointer for the bitfield32
case and just put the value there directly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:41 -07:00
Ido Schimmel
7979457b1d net: bridge: vlan: Add a schedule point during VLAN processing
User space can request to delete a range of VLANs from a bridge slave in
one netlink request. For each deleted VLAN the FDB needs to be traversed
in order to flush all the affected entries.

If a large range of VLANs is deleted and the number of FDB entries is
large or the FDB lock is contented, it is possible for the kernel to
loop through the deleted VLANs for a long time. In case preemption is
disabled, this can result in a soft lockup.

Fix this by adding a schedule point after each VLAN is deleted to yield
the CPU, if needed. This is safe because the VLANs are traversed in
process context.

Fixes: bdced7ef78 ("bridge: support for multiple vlans and vlan ranges in setlink and dellink requests")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Tested-by: Stefan Priebe - Profihost AG <s.priebe@profihost.ag>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:45:41 -07:00
Taehee Yoo
34a9c361dd hsr: remove hsr interface if all slaves are removed
When all hsr slave interfaces are removed, hsr interface doesn't work.
At that moment, it's fine to remove an unused hsr interface automatically
for saving resources.
That's a common behavior of virtual interfaces.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 13:25:18 -07:00
Eric Dumazet
a70437cc09 tcp: add hrtimer slack to sack compression
Add a sysctl to control hrtimer slack, default of 100 usec.

This gives the opportunity to reduce system overhead,
and help very short RTT flows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 13:24:01 -07:00
Eric Dumazet
ccd0628fca tcp: tcp_sack_new_ofo_skb() should be more conservative
Currently, tcp_sack_new_ofo_skb() sends an ack if prior
acks were 'compressed', if room has to be made in tp->selective_acks[]

But there is no guarantee all four sack ranges can be included
in SACK option. As a matter of fact, when TCP timestamps option
is used, only three SACK ranges can be included.

Lets assume only two ranges can be included, and force the ack:

- When we touch more than 2 ranges in the reordering
  done if tcp_sack_extend() could be done.

- If we have at least 2 ranges when adding a new one.

This enforces that before a range is in third or fourth
position, at least one ACK packet included it in first/second
position.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 13:24:01 -07:00
Eric Dumazet
2b19585012 tcp: add tp->dup_ack_counter
In commit 86de5921a3 ("tcp: defer SACK compression after DupThresh")
I added a TCP_FASTRETRANS_THRESH bias to tp->compressed_ack in order
to enable sack compression only after 3 dupacks.

Since we plan to relax this rule for flows that involve
stacks not requiring this old rule, this patch adds
a distinct tp->dup_ack_counter.

This means the TCP_FASTRETRANS_THRESH value is now used
in a single location that a future patch can adjust:

	if (tp->dup_ack_counter < TCP_FASTRETRANS_THRESH) {
		tp->dup_ack_counter++;
		goto send_now;
	}

This patch also introduces tcp_sack_compress_send_ack()
helper to ease following patch comprehension.

This patch refines LINUX_MIB_TCPACKCOMPRESSED to not
count the acks that we had to send if the timer expires
or tcp_sack_compress_send_ack() is sending an ack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 13:24:01 -07:00
Mauro Carvalho Chehab
4ac0b122ee docs: networking: convert tproxy.txt to ReST
- add SPDX header;
- adjust title markup;
- mark code blocks and literals as such;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:38 -07:00
Mauro Carvalho Chehab
9f72374cb5 docs: networking: convert rxrpc.txt to ReST
- add SPDX header;
- adjust title markup;
- use autonumbered list markups;
- mark code blocks and literals as such;
- mark tables as such;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:38 -07:00
Mauro Carvalho Chehab
66d495d0a5 docs: networking: convert radiotap-headers.txt to ReST
- add SPDX header;
- adjust title markup;
- mark code blocks and literals as such;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:37 -07:00
Mauro Carvalho Chehab
c1e4535f24 docs: networking: convert pktgen.txt to ReST
- add SPDX header;
- adjust title markup;
- use bold markups on a few places;
- mark code blocks and literals as such;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:37 -07:00
Mauro Carvalho Chehab
429ff87bca docs: networking: convert mac80211-injection.txt to ReST
- add SPDX header;
- adjust title markup;
- mark code blocks and literals as such;
- mark tables as such;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:36 -07:00
Mauro Carvalho Chehab
40e79150c1 docs: networking: convert lapb-module.txt to ReST
- add SPDX header;
- adjust title markup;
- mark code blocks and literals as such;
- mark tables as such;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:35 -07:00
Dmitry Yakunin
b1f3e43dbf inet_diag: add support for cgroup filter
This patch adds ability to filter sockets based on cgroup v2 ID.
Such filter is helpful in ss utility for filtering sockets by
cgroup pathname.

Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:54:02 -07:00
Dmitry Yakunin
6e3a401fc8 inet_diag: add cgroup id attribute
This patch adds cgroup v2 ID to common inet diag message attributes.
Cgroup v2 ID is kernfs ID (ino or ino+gen). This attribute allows filter
inet diag output by cgroup ID obtained by name_to_handle_at() syscall.
When net_cls or net_prio cgroup is activated this ID is equal to 1 (root
cgroup ID) for newly created sockets.

Some notes about this ID:

1) gets initialized in socket() syscall
2) incoming socket gets ID from listening socket
   (not during accept() syscall)
3) not changed when process get moved to another cgroup
4) can point to deleted cgroup (refcounting)

v2:
  - use CONFIG_SOCK_CGROUP_DATA instead if CONFIG_CGROUPS

v3:
  - fix attr size by using nla_total_size_64bit() (Eric Dumazet)
  - more detailed commit message (Konstantin Khlebnikov)

Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Acked-By: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:54:01 -07:00
Karsten Graul
41a211d862 net/smc: remove obsolete link state DELETING
The connection layer in af_smc.c is now using the new LLC flow
framework, which made the link state DELETING obsolete. Remove the state
and the respective helpers.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:34 -07:00
Karsten Graul
42d18acce9 net/smc: remove handling of CONFIRM_RKEY_CONTINUE
The new SMC-R multiple link support will support a maximum of 3 links,
and one CONFIRM_RKEY LLC message can transport 3 rkeys of an rmb buffer.
There is no need for the LLC message type CONFIRM_RKEY_CONTINUE which is
needed when more than 3 rkeys per rmb buffer needs to be exchanged.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:34 -07:00
Karsten Graul
218b24fe38 net/smc: adapt SMC remote DELETE_RKEY processing to use the LLC flow
Use the LLC flow framework for the processing of DELETE_RKEY messages
that were received from the peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:34 -07:00
Karsten Graul
3bc67e098c net/smc: adapt SMC remote CONFIRM_RKEY processing to use the LLC flow
Use the LLC flow framework for the processing of CONFIRM_RKEY messages
that were received from the peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:34 -07:00
Karsten Graul
ba21abd22f net/smc: new smc_rtoken_set functions for multiple link support
Introduce smc_rtoken_set() to set the rtoken for a new link to an
existing rmb whose rtoken is given, and smc_rtoken_set2() to set an
rtoken for a new link whose link_id is given.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:34 -07:00
Karsten Graul
56e8091c7a net/smc: move the TEST_LINK response processing into event handler
Get rid of the extra function and move the two-liner for the TEST_LINK
response processing into the event handler function.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
6d74c3a8a3 net/smc: multiple link support and LLC flow for smc_llc_do_delete_rkey
Adapt smc_llc_do_delete_rkey() to use the LLC flow and support multiple
links when deleting the rkeys for rmb buffers at the peer.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
3d88a21b0c net/smc: multiple link support and LLC flow for smc_llc_do_confirm_rkey
Adapt smc_llc_do_confirm_rkey() to use the LLC flow and support the
rkeys of multiple links when the CONFIRM_RKEY LLC message is build.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
0fb0b02bd6 net/smc: adapt SMC client code to use the LLC flow
Change the code that processes the SMC client part of connection
establishment to use the LLC flow framework (CONFIRM_LINK request
messages).

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
4667bb4aaa net/smc: adapt SMC server code to use the LLC flow
Change the code that processes the SMC server part of connection
establishment to use the LLC flow framework (CONFIRM_LINK response
messages).

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
92334cfcb3 net/smc: add logic to evaluate CONFIRM_LINK messages to LLC layer
Introduce smc_llc_eval_conf_link() to evaluate the CONFIRM_LINK message
contents. This implements this logic at the LLC layer. The function will
be used by af_smc.c to process the received LLC layer messages.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
81e6e5e70d net/smc: introduce link group type
Add a type field to the link group which reflects the current link group
redundancy state.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
a6688d919b net/smc: enqueue all received LLC messages
Introduce smc_llc_enqueue() to enqueue LLC messages, and adapt
smc_llc_rx_handler() to enqueue all received LLC messages.
smc_llc_enqueue() also makes it possible to enqueue LLC messages from
local code.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Karsten Graul
555da9af82 net/smc: add event-based llc_flow framework
The new framework allows to start specific types of LLC control flows,
protects active flows and makes it possible to wait for flows to finish
before starting a new flow.
This mechanism is used for the LLC control layer to model flows like
'add link' or 'delete link' which need to send/receive several LLC
messages and are not allowed to get interrupted by the wrong type of
messages.
'Add link' or 'Delete link' messages arriving in the middle of a flow
are delayed and processed when the current flow finished.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:44:33 -07:00
Paolo Abeni
ac2b47fb92 mptcp: fix uninitialized value access
tcp_v{4,6}_syn_recv_sock() set 'own_req' only when returning
a not NULL 'child', let's check 'own_req' only if child is
available to avoid an - unharmful - UBSAN splat.

v1 -> v2:
 - reference the correct hash

Fixes: 4c8941de78 ("mptcp: avoid flipping mp_capable field in syn_recv_sock()")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:34:07 -07:00
Paolo Abeni
a77895dbc0 mptcp: initialize the data_fin field for mpc packets
When parsing MPC+data packets we set the dss field, so
we must also initialize the data_fin, or we can find stray
value there.

Fixes: 9a19371bf0 ("mptcp: fix data_fin handing in RX path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:23 -07:00
Paolo Abeni
5a91e32b40 mptcp: fix 'use_ack' option access.
The mentioned RX option field is initialized only for DSS
packet, we must access it only if 'dss' is set too, or
the subflow will end-up in a bad status, leading to
RFC violations.

Fixes: d22f4988ff ("mptcp: process MP_CAPABLE data option")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni
d6085fe19b mptcp: avoid a WARN on bad input.
Syzcaller has found a way to trigger the WARN_ON_ONCE condition
in check_fully_established().

The root cause is a legit fallback to TCP scenario, so replace
the WARN with a plain message on a more strict condition.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni
cfde141ea3 mptcp: move option parsing into mptcp_incoming_options()
The mptcp_options_received structure carries several per
packet flags (mp_capable, mp_join, etc.). Such fields must
be cleared on each packet, even on dropped ones or packet
not carrying any MPTCP options, but the current mptcp
code clears them only on TCP option reset.

On several races/corner cases we end-up with stray bits in
incoming options, leading to WARN_ON splats. e.g.:

[  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
[  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
[  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
[  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
[  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
[  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
[  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
[  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
[  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
[  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
[  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
[  171.232586] Call Trace:
[  171.233109]  <IRQ>
[  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
[  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
[  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
[  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
[  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
[  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
[  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
[  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
[  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
[  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
[  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
[  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
[  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
[  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
[  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
[  171.282358]  </IRQ>

We could address the issue clearing explicitly the relevant fields
in several places - tcp_parse_option, tcp_fast_parse_options,
possibly others.

Instead we move the MPTCP option parsing into the already existing
mptcp ingress hook, so that we need to clear the fields in a single
place.

This allows us dropping an MPTCP hook from the TCP code and
removing the quite large mptcp_options_received from the tcp_sock
struct. On the flip side, the MPTCP sockets will traverse the
option space twice (in tcp_parse_option() and in
mptcp_incoming_options(). That looks acceptable: we already
do that for syn and 3rd ack packets, plain TCP socket will
benefit from it, and even MPTCP sockets will experience better
code locality, reducing the jumps between TCP and MPTCP code.

v1 -> v2:
 - rebased on current '-net' tree

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni
263e1201a2 mptcp: consolidate synack processing.
Currently the MPTCP code uses 2 hooks to process syn-ack
packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
callback.

We can drop the first, moving the relevant code into the
latter, reducing the hooking into the TCP code. This is
also needed by the next patch.

v1 -> v2:
 - use local tcp sock ptr instead of casting the sk variable
   several times - DaveM

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Yunjian Wang
99b2292ba2 net: caif: Fix use correct return type for ndo_start_xmit()
The method ndo_start_xmit() returns a value of type netdev_tx_t. Fix
the ndo function to use the correct type.

Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:13:07 -07:00
Jakub Sitnicki
64d85290d7 bpf: Allow bpf_map_lookup_elem for SOCKMAP and SOCKHASH
White-list map lookup for SOCKMAP/SOCKHASH from BPF. Lookup returns a
pointer to a full socket and acquires a reference if necessary.

To support it we need to extend the verifier to know that:

 (1) register storing the lookup result holds a pointer to socket, if
     lookup was done on SOCKMAP/SOCKHASH, and that

 (2) map lookup on SOCKMAP/SOCKHASH is a reference acquiring operation,
     which needs a corresponding reference release with bpf_sk_release.

On sock_map side, lookup handlers exposed via bpf_map_ops now bump
sk_refcnt if socket is reference counted. In turn, bpf_sk_select_reuseport,
the only in-kernel user of SOCKMAP/SOCKHASH ops->map_lookup_elem, was
updated to release the reference.

Sockets fetched from a map can be used in the same way as ones returned by
BPF socket lookup helpers, such as bpf_sk_lookup_tcp. In particular, they
can be used with bpf_sk_assign to direct packets toward a socket on TC
ingress path.

Suggested-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200429181154.479310-2-jakub@cloudflare.com
2020-04-29 23:30:59 +02:00
David S. Miller
323e395f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for nf-next:

1) Add IPS_HW_OFFLOAD status bit, from Bodong Wang.

2) Remove 128-bit limit on the set element data area, rise it
   to 64 bytes.

3) Report EOPNOTSUPP for unsupported NAT types and flags.

4) Set up nft_nat flags from the control plane path.

5) Add helper functions to set up the nf_nat_range2 structure.

6) Add netmap support for nft_nat.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 14:14:44 -07:00
Florian Westphal
42c556fef9 mptcp: replace mptcp_disconnect with a stub
Paolo points out that mptcp_disconnect is bogus:
"lock_sock(sk);
looks suspicious (lock should be already held by the caller)
And call to: tcp_disconnect(sk, flags); too, sk is not a tcp
socket".

->disconnect() gets called from e.g. inet_stream_connect when
one tries to disassociate a connected socket again (to re-connect
without closing the socket first).
MPTCP however uses mptcp_stream_connect, not inet_stream_connect,
for the mptcp-socket connect call.

inet_stream_connect only gets called indirectly, for the tcp socket,
so any ->disconnect() calls end up calling tcp_disconnect for that
tcp subflow sk.

This also explains why syzkaller has not yet reported a problem
here.  So for now replace this with a stub that doesn't do anything.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/14
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:39:49 -07:00
Karsten Graul
00a049cfde net/smc: move llc layer related init and clear into smc_llc.c
Introduce smc_llc_lgr_init() and smc_llc_lgr_clear() to implement all
llc layer specific initialization and cleanup in module smc_llc.c.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:33 -07:00
Karsten Graul
faca536008 net/smc: use mutex instead of rwlock_t to protect buffers
The locks for sndbufs and rmbs are never used from atomic context. Using
a mutex for these locks will allow to nest locks with other mutexes.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:33 -07:00
Karsten Graul
ef79d439cd net/smc: process llc responses in tasklet context
When llc responses are received then possible waiters for this response
are to be notified. This can be done in tasklet context, without to
use a work in the llc work queue. Move all code that handles llc
responses into smc_llc_rx_response().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:33 -07:00
Karsten Graul
6c8968c421 net/smc: use worker to process incoming llc messages
Incoming llc messages are processed in irq tasklet context, and
a worker is used to send outgoing messages. The worker is needed
because getting a send buffer could result in a wait for a free buffer.

To make sure all incoming llc messages are processed in a serialized way
introduce an event queue and create a new queue entry for each message
which is queued to this event queue. A new worker processes the event
queue entries in order.
And remove the use of a separate worker to send outgoing llc messages
because the messages are processed in worker context already.
With this event queue the serialized llc_wq work queue is obsolete,
remove it.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
2140ac26f8 net/smc: simplify link deactivation
Cancel the testlink worker during link clear processing and remove the
extra function smc_llc_link_inactive().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
1020e1ef53 net/smc: move testlink work to system work queue
The testlink work waits for a response to the testlink request and
blocks the single threaded llc_wq. This type of work does not have to be
serialized and can be moved to the system work queue.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
d854fcbfae net/smc: add new link state and related helpers
Before a link can be reused it must have been cleared. Lowest current
link state is INACTIVE, which does not mean that the link is already
cleared.
Add a new state UNUSED that is set when the link is cleared and can be
reused.
Add helper smc_llc_usable_link() to find an active link in a link group,
and smc_link_usable() to determine if a link is usable.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
e07d31dc16 net/smc: multi-link support for smc_rmb_rtoken_handling()
Extend smc_rmb_rtoken_handling() and smc_rtoken_delete() to support
multiple links.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
b9247544c1 net/smc: convert static link ID instances to support multiple links
As a preparation for the support of multiple links remove the usage of
a static link id (SMC_SINGLE_LINK) and allow dynamic link ids.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
387707fdf4 net/smc: convert static link ID to dynamic references
As a preparation for the support of multiple links remove the usage of
a static link id (SMC_SINGLE_LINK) and allow dynamic link ids.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
026c381fb4 net/smc: introduce link_idx for link group array
The link_id is the index of the link in the array of the link group.
When a link in the array is reused for a new link, a different unique
link_id should be used, otherwise the index in the array could collide
with the previous link at this array position.
Use a new variable link_idx as array index, and make link_id an
increasing unique id value.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
f3c1deddb2 net/smc: separate function for link initialization
Move the initialization of a new link into its own function, separate
from smc_lgr_create, to allow more than one link per link group.
Do an extra check if the IB device initialization was successful, and
reset the link state if any error occurs during smcr_link_init().
And rename two existing functions to use the prefix smcr_ to indicate
that they belong to the SMC-R code path.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Karsten Graul
fdff704dc6 net/smc: rework pnet table to support SMC-R failover
The pnet table stored pnet ids in the smc device structures. When a
device is going down its smc device structure is freed, and when the
device is brought online again it no longer has a pnet id set.
Rework the pnet table implementation to store the device name with their
assigned pnet id and apply the pnet id to devices when they are
registered.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:26:32 -07:00
Arnd Bergmann
c165d57b55 netfilter: nf_osf: avoid passing pointer to local var
gcc-10 points out that a code path exists where a pointer to a stack
variable may be passed back to the caller:

net/netfilter/nfnetlink_osf.c: In function 'nf_osf_hdr_ctx_init':
cc1: warning: function may return address of local variable [-Wreturn-local-addr]
net/netfilter/nfnetlink_osf.c:171:16: note: declared here
  171 |  struct tcphdr _tcph;
      |                ^~~~~

I am not sure whether this can happen in practice, but moving the
variable declaration into the callers avoids the problem.

Fixes: 31a9c29210 ("netfilter: nf_osf: add struct nf_osf_hdr_ctx")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-29 21:17:57 +02:00
YueHaibing
0477e032a9 ila: remove unused inline function ila_addr_is_ila
There's no callers in-tree anymore since commit 84287bb328 ("ila: add
checksum neutral map auto").

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:01:31 -07:00
YueHaibing
ad56623119 net: hsr: remove unused inline functions
There's no callers in-tree anymore.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 12:00:35 -07:00
Yunjian Wang
a54776f2c4 netpoll: Fix use correct return type for ndo_start_xmit()
The method ndo_start_xmit() returns a value of type netdev_tx_t. Fix
the ndo function to use the correct type.

Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 11:47:16 -07:00
Rajkumar Manoharan
60689de46c mac80211: fix memory overlap due to variable length param
As of now HE operation element in bss_conf includes variable length
optional field followed by other HE variable. Though the optional
field never be used, actually it is referring to next member of the
bss_conf structure which is not correct. Fix it by declaring needed
HE operation fields within bss_conf itself.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1587768108-25248-2-git-send-email-rmanohar@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-29 16:21:45 +02:00
Jouni Malinen
08fad438be mac80211: TX legacy rate control for Beacon frames
Use the Beacon frame specific legacy rate configuration, if specified
for AP or mesh, instead of the generic rate mask when selecting the TX
rate for Beacon frames.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200425155713.25687-4-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-29 16:16:47 +02:00
Richard Guy Briggs
a45d88530b netfilter: add audit table unregister actions
Audit the action of unregistering ebtables and x_tables.

See: https://github.com/linux-audit/audit-kernel/issues/44

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-28 18:11:36 -04:00
Richard Guy Briggs
c4dad0aab3 audit: tidy and extend netfilter_cfg x_tables
NETFILTER_CFG record generation was inconsistent for x_tables and
ebtables configuration changes.  The call was needlessly messy and there
were supporting records missing at times while they were produced when
not requested.  Simplify the logging call into a new audit_log_nfcfg
call.  Honour the audit_enabled setting while more consistently
recording information including supporting records by tidying up dummy
checks.

Add an op= field that indicates the operation being performed (register
or replace).

Here is the enhanced sample record:
  type=NETFILTER_CFG msg=audit(1580905834.919:82970): table=filter family=2 entries=83 op=replace

Generate audit NETFILTER_CFG records on ebtables table registration.
Previously this was being done for x_tables registration and replacement
operations and ebtables table replacement only.

See: https://github.com/linux-audit/audit-kernel/issues/25
See: https://github.com/linux-audit/audit-kernel/issues/35
See: https://github.com/linux-audit/audit-kernel/issues/43

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
2020-04-28 17:52:42 -04:00
Mauro Carvalho Chehab
19093313cb docs: networking: convert ipv6.txt to ReST
Not much to be done here:

- add SPDX header;
- add a document title;
- mark a literal as such, in order to avoid a warning;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:40:18 -07:00
Mauro Carvalho Chehab
1cec2cacaa docs: networking: convert ip-sysctl.txt to ReST
- add SPDX header;
- adjust titles and chapters, adding proper markups;
- mark code blocks and literals as such;
- mark lists as such;
- mark tables as such;
- use footnote markup;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:40:18 -07:00
Mauro Carvalho Chehab
8c49893558 docs: networking: convert gen_stats.txt to ReST
- add SPDX header;
- mark code blocks and literals as such;
- mark tables as such;
- mark lists as such;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:39:46 -07:00
Mauro Carvalho Chehab
9dfe136126 docs: networking: convert dns_resolver.txt to ReST
- add SPDX header;
- adjust titles and chapters, adding proper markups;
- comment out text-only TOC from html/pdf output;

- mark code blocks and literals as such;

- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:39:46 -07:00
Mauro Carvalho Chehab
9a69fb9c21 docs: networking: convert decnet.txt to ReST
- add SPDX header;
- adjust titles and chapters, adding proper markups;
- mark lists as such;
- mark code blocks and literals as such;
- adjust identation, whitespaces and blank lines;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:39:45 -07:00
Mauro Carvalho Chehab
20b943f075 docs: networking: convert ax25.txt to ReST
There isn't much to be done here. Just:

- add SPDX header;
- add a document title.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:38:38 -07:00
Mauro Carvalho Chehab
ff2269f16a docs: networking: convert atm.txt to ReST
There isn't much to be done here. Just:

- add SPDX header;
- add a document title.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:38:38 -07:00
YueHaibing
8999dc8949 net/x25: Fix null-ptr-deref in x25_disconnect
We should check null before do x25_neigh_put in x25_disconnect,
otherwise may cause null-ptr-deref like this:

 #include <sys/socket.h>
 #include <linux/x25.h>

 int main() {
    int sck_x25;
    sck_x25 = socket(AF_X25, SOCK_SEQPACKET, 0);
    close(sck_x25);
    return 0;
 }

BUG: kernel NULL pointer dereference, address: 00000000000000d8
CPU: 0 PID: 4817 Comm: t2 Not tainted 5.7.0-rc3+ #159
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.3-
RIP: 0010:x25_disconnect+0x91/0xe0
Call Trace:
 x25_release+0x18a/0x1b0
 __sock_release+0x3d/0xc0
 sock_close+0x13/0x20
 __fput+0x107/0x270
 ____fput+0x9/0x10
 task_work_run+0x6d/0xb0
 exit_to_usermode_loop+0x102/0x110
 do_syscall_64+0x23c/0x260
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Reported-by: syzbot+6db548b615e5aeefdce2@syzkaller.appspotmail.com
Fixes: 4becb7ee5b ("net/x25: Fix x25_neigh refcnt leak when x25 disconnect")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 14:08:59 -07:00
Trond Myklebust
8e2912c7c6 NFSoRDMA Client Fixes for Linux 5.7
Bugfixes:
 - Restore wake-up-all to rpcrdma_cm_event_handler()
   - Otherwise the client won't respond to server disconnect requests
 - Fix tracepoint use-after-free race
 - Fix usage of xdr_stream_encode_item_{present, absent}
   - These functions return a size on success, and not 0
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl6jVawACgkQ18tUv7Cl
 QOvklhAAmg0MImljNfQ6FmMHQokxtLlOuw7P8HQsARrwxO+cJjSx1vriCI0EyE+T
 yB386Bd+tGUurDZf6k+TjctQlVNSfz7Pv2gJTj/rQEvZFM2AopOBjbovKnNOnOTj
 aPnNOkshZzxZk9JYh9RN3zmyZHtlldimnvLloUrO3nZ63iUGuF1nBkQ9TWuG/CSz
 XuE1bUJgsJkdG07+I1yLrn9MAOdgzuJB2TJ5hL2GjN0BIrJ3jzweXVYXrLFikZUD
 vRTyf1K/zcDOVKPn/Aw4NSPNeTBlsP0Ain9zY7C2cuEXasu1gbEXYmmq/qvrF6kt
 N9TIDqqDZgDRA4Z3nfCJZOXyNwZ3nyEx0+S+8lK58w6khXRYiD21bxrA1eSB3aIS
 LzqJIoto/2/cl8DjNbuyvB7znjlthxYPcOo6cCeJOQ7jPWi6IDDZ6t9mwnXhSU3C
 +GdUyulWJPfx9hj6HlDbrXy6ESlCIFYobHGKVqYu7ZvaIxVVSdlfAIR64PsD7edX
 m3NzNcH1PJNKmTFvyyQDuMgwxauVLaAAs2A0Vvnzt8Vq8uSxEiySVUXOjH4+DnNm
 NNfDXGq7yQj9NqAsqlCe+zkcBT7Wy7ds+y05tipWD92Lu2mvOE+3Lht5H5GnIyWl
 1Lq/eUmrBBlxYRK0ir/RRI3aBYvuO4AK/g81ZzGQZAiWAL319MI=
 =jre3
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-5.7-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFSoRDMA Client Fixes for Linux 5.7

Bugfixes:
- Restore wake-up-all to rpcrdma_cm_event_handler()
  - Otherwise the client won't respond to server disconnect requests
- Fix tracepoint use-after-free race
- Fix usage of xdr_stream_encode_item_{present, absent}
  - These functions return a size on success, and not 0

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-28 15:59:03 -04:00
NeilBrown
7c4310ff56 SUNRPC: defer slow parts of rpc_free_client() to a workqueue.
The rpciod workqueue is on the write-out path for freeing dirty memory,
so it is important that it never block waiting for memory to be
allocated - this can lead to a deadlock.

rpc_execute() - which is often called by an rpciod work item - calls
rcp_task_release_client() which can lead to rpc_free_client().

rpc_free_client() makes two calls which could potentially block wating
for memory allocation.

rpc_clnt_debugfs_unregister() calls into debugfs and will block while
any of the debugfs files are being accessed.  In particular it can block
while any of the 'open' methods are being called and all of these use
malloc for one thing or another.  So this can deadlock if the memory
allocation waits for NFS to complete some writes via rpciod.

rpc_clnt_remove_pipedir() can take the inode_lock() and while it isn't
obvious that memory allocations can happen while the lock it held, it is
safer to assume they might and to not let rpciod call
rpc_clnt_remove_pipedir().

So this patch moves these two calls (together with the final kfree() and
rpciod_down()) into a work-item to be run from the system work-queue.
rpciod can continue its important work, and the final stages of the free
can happen whenever they happen.

I have seen this deadlock on a 4.12 based kernel where debugfs used
synchronize_srcu() when removing objects.  synchronize_srcu() requires a
workqueue and there were no free workther threads and none could be
allocated.  While debugsfs no longer uses SRCU, I believe the deadlock
is still possible.

Signed-off-by: NeilBrown <neilb@suse.de>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-28 15:58:38 -04:00
Roopa Prabhu
4f80116d3d net: ipv4: add sysctl for nexthop api compatibility mode
Current route nexthop API maintains user space compatibility
with old route API by default. Dumps and netlink notifications
support both new and old API format. In systems which have
moved to the new API, this compatibility mode cancels some
of the performance benefits provided by the new nexthop API.

This patch adds new sysctl nexthop_compat_mode which is on
by default but provides the ability to turn off compatibility
mode allowing systems to run entirely with the new routing
API. Old route API behaviour and support is not modified by this
sysctl.

Uses a single sysctl to cover both ipv4 and ipv6 following
other sysctls. Covers dumps and delete notifications as
suggested by David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 12:50:37 -07:00
Roopa Prabhu
11dd74b338 net: ipv6: new arg skip_notify to ip6_rt_del
Used in subsequent work to skip route delete
notifications on nexthop deletes.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 12:50:37 -07:00
Daniel Borkmann
0b54142e4b Merge branch 'work.sysctl' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull in Christoph Hellwig's series that changes the sysctl's ->proc_handler
methods to take kernel pointers instead. It gets rid of the set_fs address
space overrides used by BPF. As per discussion, pull in the feature branch
into bpf-next as it relates to BPF sysctl progs.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200427071508.GV23230@ZenIV.linux.org.uk/T/
2020-04-28 21:23:38 +02:00
Alain Michaud
4364f2e91f Bluetooth: allow scatternet connections if supported.
This change allows scatternet connections to be created if the
controller reports support and the HCI_QUIRK_VALID_LE_STATES indicates
that the reported LE states can be trusted.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-28 11:49:12 +02:00
Sabrina Dubroca
26333c37fc xfrm: add IPv6 support for espintcp
This extends espintcp to support IPv6, building on the existing code
and the new UDPv6 encapsulation support. Most of the code is either
reused directly (stream parser, ULP) or very similar to the IPv4
variant (net/ipv6/esp6.c changes).

The separation of config options for IPv4 and IPv6 espintcp requires a
bit of Kconfig gymnastics to enable the core code.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-28 11:28:36 +02:00
Sabrina Dubroca
0146dca70b xfrm: add support for UDPv6 encapsulation of ESP
This patch adds support for encapsulation of ESP over UDPv6. The code
is very similar to the IPv4 encapsulation implementation, and allows
to easily add espintcp on IPv6 as a follow-up.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-28 11:28:36 +02:00
Pablo Neira Ayuso
3ff7ddb135 netfilter: nft_nat: add netmap support
This patch allows you to NAT the network address prefix onto another
network address prefix, a.k.a. netmapping.

Userspace must specify the NF_NAT_RANGE_NETMAP flag and the prefix
address through the NFTA_NAT_REG_ADDR_MIN and NFTA_NAT_REG_ADDR_MAX
netlink attributes.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-28 00:53:54 +02:00
Pablo Neira Ayuso
acd766e31b netfilter: nft_nat: add helper function to set up NAT address and protocol
This patch add nft_nat_setup_addr() and nft_nat_setup_proto() to set up
the NAT mangling.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-28 00:53:54 +02:00
Pablo Neira Ayuso
4566aa4400 netfilter: nft_nat: set flags from initialization path
This patch sets the NAT flags from the control plane path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-28 00:53:54 +02:00
Pablo Neira Ayuso
0d7c83463f netfilter: nft_nat: return EOPNOTSUPP if type or flags are not supported
Instead of EINVAL which should be used for malformed netlink messages.

Fixes: eb31628e37 ("netfilter: nf_tables: Add support for IPv6 NAT")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-28 00:53:45 +02:00
Xiyu Yang
18f02ad19e bpf: Fix sk_psock refcnt leak when receiving message
tcp_bpf_recvmsg() invokes sk_psock_get(), which returns a reference of
the specified sk_psock object to "psock" with increased refcnt.

When tcp_bpf_recvmsg() returns, local variable "psock" becomes invalid,
so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in several exception handling paths
of tcp_bpf_recvmsg(). When those error scenarios occur such as "flags"
includes MSG_ERRQUEUE, the function forgets to decrease the refcnt
increased by sk_psock_get(), causing a refcnt leak.

Fix this issue by calling sk_psock_put() or pulling up the error queue
read handling when those error scenarios occur.

Fixes: e7a5f1f1cd ("bpf/sockmap: Read psock ingress_msg before sk_receive_queue")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/1587872115-42805-1-git-send-email-xiyuyang19@fudan.edu.cn
2020-04-27 23:04:05 +02:00
Pablo Neira Ayuso
fdb9c405e3 netfilter: nf_tables: allow up to 64 bytes in the set element data area
So far, the set elements could store up to 128-bits in the data area.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-27 22:20:09 +02:00
David S. Miller
a21643cf56 This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - fix spelling error, by Sven Eckelmann
 
  - drop unneeded types.h include, by Sven Eckelmann
 
  - change random number generation to prandom_u32_max(),
    by Sven Eckelmann
 
  - remove unused function batadv_arp_change_timeout(), by Yue Haibing
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl6m9HEWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoeKyEADJXMFP14aDblfXDyA/JwEJO7E8
 R1YRjbCz7t5gqJbGQGfqGZo0TgpSqvYXVJZEL8ARXR9hIpJEaUiVaMBOOEDTuJj8
 kBG/LupxUYpLPUXWobJRFwhAn+Q8z69gyNacF+/upuYaAJCJSLjOitcttjAeVdD+
 w3M6Eyw6RGpKIa17E0wG4cwRd2vvJ7SY/Sv4i5G7eiD7PmfsaVYXvyowjk0OZryP
 RhytuKx/+e/hv1fS66aW0AdiuLjH/zKDNTDEb+QXePrEyT9DutRu33l1EFgXPfmP
 VsTWxCYPNzj3CREfGdYh6ETmZeYMbkGM4Rt7yETXO/lUI5bhyPxzPBlZdYmP4XEN
 WZ2ggFvY01UagYmqV9GNr71IWkmCoI1vfphoEBktv41sIQW26g7ByrQRzU7zhNOp
 W9fNIzVt1Y6PiIcZRyVoZ3VwEn9db7T1uX669vGwaxFigL9/r2dcrj+pQfLN5kj6
 qigz/XMjawq5DloyyunSiu3qJ9iTsv3Dh0MeVBhBRdFeZEVht7whHAPqWhxxgpL/
 1FuEPqHSCVmc10IDR9VChVm1Xq9GaIVGdekW2hMNl6ry+T1f6b5akjWSa46r1mWO
 byIK3maK6s52Bwa9d4W2fGzKTbxBG9U2PYWP4IlTyr2hRDQbcL+RK6FJfZ068h/d
 I26onCKv/mURd6OUxw==
 =5bKt
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20200427' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - fix spelling error, by Sven Eckelmann

 - drop unneeded types.h include, by Sven Eckelmann

 - change random number generation to prandom_u32_max(),
   by Sven Eckelmann

 - remove unused function batadv_arp_change_timeout(), by Yue Haibing
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 13:04:18 -07:00
David S. Miller
37255e7a8f Here are some batman-adv bugfixes:
- fix random number generation in network coding, by George Spelvin
 
  - fix reference counter leaks, by Xiyu Yang (3 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl6m8zEWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoc4dD/9rKLN853rKJe30QkOBJeinSJOO
 /aEgCbsOVxgLCfvZxdiwSDQRTPFxChXcNZri6cuHu4h7T9ZYcBEKQNRy4D//h9YH
 9wx2+WP7rTKtfScVqOnBf+03fIYzOS9qQnpoA/6id3YIqg5AAhCTB2n5CS1/ILI4
 6WrDfsJH4jgol4Ud+lVAFSokjElxVpQ+5AlcKVTBGli1f4aReu7g3xPSZVIBuikz
 axsO3UEfzLNU+k+Y3Y+/97HWdaEFelIkLoymZSSOxJ/BRX2Dz8M/eUyy/p5iVzb2
 gSF46estgvJodVYrwUEb3UtSSpHP0p562h2QxxjqgayObpJQDwKNeEWkmwHb5RcF
 a8l025LlA9+iW25r/RttXiQ0FoLymY1PCdcsY36MZUXAnFn0UyRv/Zt1CUosoImc
 C9+CA7jM9XlXeipXnHQvNryUcqwsJJI1hUF7WmIRuOAdeqcvNPTzLiCYWAGQmZ4U
 V9CdHQseFKgNnDTaoraW5AMJmuP8frXKcdJd4XWJvOgQO6rX53VTEYqmzqBjFjQZ
 dynAIY1w+REmbqYeROgpp02VVXpXw7LVgKRyQIaDJa0cjXfpgmWATyWuewgLgfyK
 9EGp2Adu4z7QoVRG7I9Uq2UwPuDVPbbKohNgQZsJt98itKkqv1afQ8U83ref6LQO
 5/Mbxke3iQ9ftYVqlg==
 =fNPb
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20200427' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - fix random number generation in network coding, by George Spelvin

 - fix reference counter leaks, by Xiyu Yang (3 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 13:02:55 -07:00
Eric Dumazet
df4953e4e9 sch_sfq: validate silly quantum values
syzbot managed to set up sfq so that q->scaled_quantum was zero,
triggering an infinite loop in sfq_dequeue()

More generally, we must only accept quantum between 1 and 2^18 - 7,
meaning scaled_quantum must be in [1, 0x7FFF] range.

Otherwise, we also could have a loop in sfq_dequeue()
if scaled_quantum happens to be 0x8000, since slot->allot
could indefinitely switch between 0 and 0x8000.

Fixes: eeaeb068f1 ("sch_sfq: allow big packets and be fair")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+0251e883fe39e7a0cb0a@syzkaller.appspotmail.com
Cc: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:49:57 -07:00
Horatiu Vultur
419dba8a49 net: bridge: Add checks for enabling the STP.
It is not possible to have the MRP and STP running at the same time on the
bridge, therefore add check when enabling the STP to check if MRP is already
enabled. In that case return error.

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
6536993371 bridge: mrp: Integrate MRP into the bridge
To integrate MRP into the bridge, the bridge needs to do the following:
- detect if the MRP frame was received on MRP ring port in that case it would be
  processed otherwise just forward it as usual.
- enable parsing of MRP
- before whenever the bridge was set up, it would set all the ports in
  forwarding state. Add an extra check to not set ports in forwarding state if
  the port is an MRP ring port. The reason of this change is that if the MRP
  instance initially sets the port in blocked state by setting the bridge up it
  would overwrite this setting.

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
4d02b8f075 bridge: mrp: Implement netlink interface to configure MRP
Implement netlink interface to configure MRP. The implementation
will do sanity checks over the attributes and then eventually call the MRP
interface.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
9a9f26e8f7 bridge: mrp: Connect MRP API with the switchdev API
Implement the MRP API.

In case the HW can't generate MRP Test frames then the SW will try to generate
the frames. In case that also the SW will fail in generating the frames then a
error is return to the userspace. The userspace is responsible to generate all
the other MRP frames regardless if the test frames are generated by HW or SW.

The forwarding/termination of MRP frames is happening in the kernel and is done
by the MRP instance. The userspace application doesn't do the forwarding.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
fadd409136 bridge: switchdev: mrp: Implement MRP API for switchdev
Implement the MRP api for switchdev.
These functions will just eventually call the switchdev functions:
switchdev_port_obj_add/del and switchdev_port_attr_set.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
2f1a11ae11 bridge: mrp: Add MRP interface.
Define the MRP interface.
This interface is used by the netlink to update the MRP instances and by the MRP
to make the calls to switchdev to offload it to HW.

It defines an MRP instance 'struct br_mrp' which is a list of MRP instances.
Which will be part of the 'struct net_bridge'. Each instance has 2 ring ports,
a bridge and an ID.

In case the HW can't generate MRP Test frames then the SW will generate those.

br_mrp_add - adds a new MRP instance.

br_mrp_del - deletes an existing MRP instance. Each instance has an ID(ring_id).

br_mrp_set_port_state - changes the port state. The port can be in forwarding
  state, which means that the frames can pass through or in blocked state which
  means that the frames can't pass through except MRP frames. This will
  eventually call the switchdev API to notify the HW. This information is used
  also by the SW bridge to know how to forward frames in case the HW doesn't
  have this capability.

br_mrp_set_port_role - a port role can be primary or secondary. This
  information is required to be pushed to HW in case the HW can generate
  MRP_Test frames.  Because the MRP_Test frames contains a file with this
  information. Otherwise the HW will not be able to generate the frames
  correctly.

br_mrp_set_ring_state - a ring can be in state open or closed. State open means
  that the mrp port stopped receiving MRP_Test frames, while closed means that
  the mrp port received MRP_Test frames. Similar with br_mrp_port_role, this
  information is pushed in HW because the MRP_Test frames contain this
  information.

br_mrp_set_ring_role - a ring can have the following roles MRM or MRC. For the
  role MRM it is expected that the HW can terminate the MRP frames, notify the
  SW that it stopped receiving MRP_Test frames and trapp all the other MRP
  frames.  While for MRC mode it is expected that the HW can forward the MRP
  frames only between the MRP ports and copy MRP_Topology frames to CPU. In
  case the HW doesn't support a role it needs to return an error code different
  than -EOPNOTSUPP.

br_mrp_start_test - this starts/stops the generation of MRP_Test frames. To stop
  the generation of frames the interval needs to have a value of 0. In this case
  the userspace needs to know if the HW supports this or not. Not to have
  duplicate frames(generated by HW and SW). Because if the HW supports this then
  the SW will not generate anymore frames and will expect that the HW will
  notify when it stopped receiving MRP frames using the function
  br_mrp_port_open.

br_mrp_port_open - this function is used by drivers to notify the userspace via
  a netlink callback that one of the ports stopped receiving MRP_Test frames.
  This function is called only when the node has the role MRM. It is not
  supposed to be called from userspace.

br_mrp_port_switchdev_add - this corresponds to the function br_mrp_add,
  and will notify the HW that a MRP instance is added. The function gets
  as parameter the MRP instance.

br_mrp_port_switchdev_del - this corresponds to the function br_mrp_del,
  and will notify the HW that a MRP instance is removed. The function
  gets as parameter the ID of the MRP instance that is removed.

br_mrp_port_switchdev_set_state - this corresponds to the function
  br_mrp_set_port_state. It would notify the HW if it should block or not
  non-MRP frames.

br_mrp_port_switchdev_set_port - this corresponds to the function
  br_mrp_set_port_role. It would set the port role, primary or secondary.

br_mrp_switchdev_set_role - this corresponds to the function
  br_mrp_set_ring_role and would set one of the role MRM or MRC.

br_mrp_switchdev_set_ring_state - this corresponds to the function
  br_mrp_set_ring_state and would set the ring to be open or closed.

br_mrp_switchdev_send_ring_test - this corresponds to the function
  br_mrp_start_test. This will notify the HW to start or stop generating
  MRP_Test frames. Value 0 for the interval parameter means to stop generating
  the frames.

br_mrp_port_open - this function is used to notify the userspace that the port
  lost the continuity of MRP Test frames.

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
3e54442c93 net: bridge: Add port attribute IFLA_BRPORT_MRP_RING_OPEN
This patch adds a new port attribute, IFLA_BRPORT_MRP_RING_OPEN, which allows
to notify the userspace when the port lost the continuite of MRP frames.

This attribute is set by kernel whenever the SW or HW detects that the ring is
being open or closed.

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
4b8d7d4c59 bridge: mrp: Extend bridge interface
To integrate MRP into the bridge, first the bridge needs to be aware of ports
that are part of an MRP ring and which rings are on the bridge.
Therefore extend bridge interface with the following:
- add new flag(BR_MPP_AWARE) to the net bridge ports, this bit will be
  set when the port is added to an MRP instance. In this way it knows if
  the frame was received on MRP ring port
- add new flag(BR_MRP_LOST_CONT) to the net bridge ports, this bit will be set
  when the port lost the continuity of MRP Test frames.
- add a list of MRP instances

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Horatiu Vultur
2cc974f83f bridge: mrp: Update Kconfig
Add the option BRIDGE_MRP to allow to build in or not MRP support.
The default value is N.

Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Eric Dumazet
8738c85c72 sch_choke: avoid potential panic in choke_reset()
If choke_init() could not allocate q->tab, we would crash later
in choke_reset().

BUG: KASAN: null-ptr-deref in memset include/linux/string.h:366 [inline]
BUG: KASAN: null-ptr-deref in choke_reset+0x208/0x340 net/sched/sch_choke.c:326
Write of size 8 at addr 0000000000000000 by task syz-executor822/7022

CPU: 1 PID: 7022 Comm: syz-executor822 Not tainted 5.7.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 __kasan_report.cold+0x5/0x4d mm/kasan/report.c:515
 kasan_report+0x33/0x50 mm/kasan/common.c:625
 check_memory_region_inline mm/kasan/generic.c:187 [inline]
 check_memory_region+0x141/0x190 mm/kasan/generic.c:193
 memset+0x20/0x40 mm/kasan/common.c:85
 memset include/linux/string.h:366 [inline]
 choke_reset+0x208/0x340 net/sched/sch_choke.c:326
 qdisc_reset+0x6b/0x520 net/sched/sch_generic.c:910
 dev_deactivate_queue.constprop.0+0x13c/0x240 net/sched/sch_generic.c:1138
 netdev_for_each_tx_queue include/linux/netdevice.h:2197 [inline]
 dev_deactivate_many+0xe2/0xba0 net/sched/sch_generic.c:1195
 dev_deactivate+0xf8/0x1c0 net/sched/sch_generic.c:1233
 qdisc_graft+0xd25/0x1120 net/sched/sch_api.c:1051
 tc_modify_qdisc+0xbab/0x1a00 net/sched/sch_api.c:1670
 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5454
 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2469
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362
 ___sys_sendmsg+0x100/0x170 net/socket.c:2416
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295

Fixes: 77e62da6e6 ("sch_choke: drop all packets in queue during reset")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:35:27 -07:00
Eric Dumazet
14695212d4 fq_codel: fix TCA_FQ_CODEL_DROP_BATCH_SIZE sanity checks
My intent was to not let users set a zero drop_batch_size,
it seems I once again messed with min()/max().

Fixes: 9d18562a22 ("fq_codel: add batch ability to fq_codel_drop()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:34:04 -07:00
Xiyu Yang
62b4011fa7 net/tls: Fix sk_psock refcnt leak when in tls_data_ready()
tls_data_ready() invokes sk_psock_get(), which returns a reference of
the specified sk_psock object to "psock" with increased refcnt.

When tls_data_ready() returns, local variable "psock" becomes invalid,
so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
tls_data_ready(). When "psock->ingress_msg" is empty but "psock" is not
NULL, the function forgets to decrease the refcnt increased by
sk_psock_get(), causing a refcnt leak.

Fix this issue by calling sk_psock_put() on all paths when "psock" is
not NULL.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:22:38 -07:00
Xiyu Yang
4becb7ee5b net/x25: Fix x25_neigh refcnt leak when x25 disconnect
x25_connect() invokes x25_get_neigh(), which returns a reference of the
specified x25_neigh object to "x25->neighbour" with increased refcnt.

When x25 connect success and returns, the reference still be hold by
"x25->neighbour", so the refcount should be decreased in
x25_disconnect() to keep refcount balanced.

The reference counting issue happens in x25_disconnect(), which forgets
to decrease the refcnt increased by x25_get_neigh() in x25_connect(),
causing a refcnt leak.

Fix this issue by calling x25_neigh_put() before x25_disconnect()
returns.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:20:30 -07:00
Xiyu Yang
095f5614bf net/tls: Fix sk_psock refcnt leak in bpf_exec_tx_verdict()
bpf_exec_tx_verdict() invokes sk_psock_get(), which returns a reference
of the specified sk_psock object to "psock" with increased refcnt.

When bpf_exec_tx_verdict() returns, local variable "psock" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling path of
bpf_exec_tx_verdict(). When "policy" equals to NULL but "psock" is not
NULL, the function forgets to decrease the refcnt increased by
sk_psock_get(), causing a refcnt leak.

Fix this issue by calling sk_psock_put() on this error path before
bpf_exec_tx_verdict() returns.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:17:16 -07:00
Colin Ian King
7d31180167 net: rtnetlink: remove redundant assignment to variable err
The variable err is being initializeed with a value that is never read
and it is being updated later with a new value. The initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:14:07 -07:00
Stefano Garzarella
a78d163978 vsock/virtio: fix multiple packet delivery to monitoring devices
In virtio_transport.c, if the virtqueue is full, the transmitting
packet is queued up and it will be sent in the next iteration.
This causes the same packet to be delivered multiple times to
monitoring devices.

We want to continue to deliver packets to monitoring devices before
it is put in the virtqueue, to avoid that replies can appear in the
packet capture before the transmitted packet.

This patch fixes the issue, adding a new flag (tap_delivered) in
struct virtio_vsock_pkt, to check if the packet is already delivered
to monitoring devices.

In vhost/vsock.c, we are splitting packets, so we must set
'tap_delivered' to false when we queue up the same virtio_vsock_pkt
to handle the remaining bytes.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 10:18:01 -07:00
Chuck Lever
0a8e7b7d08 SUNRPC: Revert 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
I've noticed that when krb5i or krb5p security is in use,
retransmitted requests are missing the server's duplicate reply
cache. The computed checksum on the retransmitted request does not
match the cached checksum, resulting in the server performing the
retransmitted request again instead of returning the cached reply.

The assumptions made when removing xdr_buf_trim() were not correct.
In the send paths, the upper layer has already set the segment
lengths correctly, and shorting the buffer's content is simply a
matter of reducing buf->len.

xdr_buf_trim() is the right answer in the receive/unwrap path on
both the client and the server. The buffer segment lengths have to
be shortened one-by-one.

On the server side in particular, head.iov_len needs to be updated
correctly to enable nfsd_cache_csum() to work correctly. The simple
buf->len computation doesn't do that, and that results in
checksumming stale data in the buffer.

The problem isn't noticed until there's significant instability of
the RPC transport. At that point, the reliability of retransmit
detection on the server becomes crucial.

Fixes: 241b1f419f ("SUNRPC: Remove xdr_buf_trim()")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27 10:58:30 -04:00
Chuck Lever
a7e429a6fa SUNRPC: Fix GSS privacy computation of auth->au_ralign
When the au_ralign field was added to gss_unwrap_resp_priv, the
wrong calculation was used. Setting au_rslack == au_ralign is
probably correct for kerberos_v1 privacy, but kerberos_v2 privacy
adds additional GSS data after the clear text RPC message.
au_ralign needs to be smaller than au_rslack in that fairly common
case.

When xdr_buf_trim() is restored to gss_unwrap_kerberos_v2(), it does
exactly what I feared it would: it trims off part of the clear text
RPC message. However, that's because rpc_prepare_reply_pages() does
not set up the rq_rcv_buf's tail correctly because au_ralign is too
large.

Fixing the au_ralign computation also corrects the alignment of
rq_rcv_buf->pages so that the client does not have to shift reply
data payloads after they are received.

Fixes: 35e77d21ba ("SUNRPC: Add rpc_auth::au_ralign field")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27 10:58:30 -04:00
Chuck Lever
31c9590ae4 SUNRPC: Add "@len" parameter to gss_unwrap()
Refactor: This is a pre-requisite to fixing the client-side ralign
computation in gss_unwrap_resp_priv().

The length value is passed in explicitly rather that as the value
of buf->len. This will subsequently allow gss_unwrap_kerberos_v1()
to compute a slack and align value, instead of computing it in
gss_unwrap_resp_priv().

Fixes: 35e77d21ba ("SUNRPC: Add rpc_auth::au_ralign field")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-27 10:58:30 -04:00
Nicolas Dichtel
e62905ae34 xfrm interface: don't take extra reference to netdev
I don't see any reason to do this. Maybe needed before
commit 56c5ee1a58 ("xfrm interface: fix memory leak on creation").

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-27 08:50:49 +02:00
Christoph Hellwig
32927393dc sysctl: pass kernel pointers to ->proc_handler
Instead of having all the sysctl handlers deal with user pointers, which
is rather hairy in terms of the BPF interaction, copy the input to and
from  userspace in common code.  This also means that the strings are
always NUL-terminated by the common code, making the API a little bit
safer.

As most handler just pass through the data to one of the common handlers
a lot of the changes are mechnical.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-04-27 02:07:40 -04:00
Guillaume Nault
ea64d8d6c6 netfilter: nat: never update the UDP checksum when it's 0
If the UDP header of a local VXLAN endpoint is NAT-ed, and the VXLAN
device has disabled UDP checksums and enabled Tx checksum offloading,
then the skb passed to udp_manip_pkt() has hdr->check == 0 (outer
checksum disabled) and skb->ip_summed == CHECKSUM_PARTIAL (inner packet
checksum offloaded).

Because of the ->ip_summed value, udp_manip_pkt() tries to update the
outer checksum with the new address and port, leading to an invalid
checksum sent on the wire, as the original null checksum obviously
didn't take the old address and port into account.

So, we can't take ->ip_summed into account in udp_manip_pkt(), as it
might not refer to the checksum we're acting on. Instead, we can base
the decision to update the UDP checksum entirely on the value of
hdr->check, because it's null if and only if checksum is disabled:

  * A fully computed checksum can't be 0, since a 0 checksum is
    represented by the CSUM_MANGLED_0 value instead.

  * A partial checksum can't be 0, since the pseudo-header always adds
    at least one non-zero value (the UDP protocol type 0x11) and adding
    more values to the sum can't make it wrap to 0 as the carry is then
    added to the wrapped number.

  * A disabled checksum uses the special value 0.

The problem seems to be there from day one, although it was probably
not visible before UDP tunnels were implemented.

Fixes: 5b1158e909 ("[NETFILTER]: Add NAT support for nf_conntrack")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-26 23:57:18 +02:00
Bodong Wang
74f99482ea netfilter: nf_conntrack: add IPS_HW_OFFLOAD status bit
This bit indicates that the conntrack entry is offloaded to hardware
flow table. nf_conntrack entry will be tagged with [HW_OFFLOAD] if
it's offload to hardware.

cat /proc/net/nf_conntrack
	ipv4 2 tcp 6 \
	src=1.1.1.17 dst=1.1.1.16 sport=56394 dport=5001 \
	src=1.1.1.16 dst=1.1.1.17 sport=5001 dport=56394 [HW_OFFLOAD] \
	mark=0 zone=0 use=3

Note that HW_OFFLOAD/OFFLOAD/ASSURED are mutually exclusive.

Changelog:

* V1->V2:
- Remove check of lastused from stats. It was meant for cases such
  as removing driver module while traffic still running. Better to
  handle such cases from garbage collector.

Signed-off-by: Bodong Wang <bodong@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-26 23:54:35 +02:00
Tobias Klauser
0a05861f80 xsk: Fix typo in xsk_umem_consume_tx and xsk_generic_xmit comments
s/backpreassure/backpressure/

Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/20200421232927.21082-1-tklauser@distanz.ch
2020-04-26 09:41:31 -07:00
Lorenzo Colitti
6f3f65d80d net: bpf: Allow TC programs to call BPF_FUNC_skb_change_head
This allows TC eBPF programs to modify and forward (redirect) packets
from interfaces without ethernet headers (for example cellular)
to interfaces with (for example ethernet/wifi).

The lack of this appears to simply be an oversight.

Tested:
  in active use in Android R on 4.14+ devices for ipv6
  cellular to wifi tethering offload.

Signed-off-by: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-04-26 09:00:50 -07:00
Stanislav Fomichev
6890896bd7 bpf: Fix missing bpf_base_func_proto in cgroup_base_func_proto for CGROUP_NET=n
linux-next build bot reported compile issue [1] with one of its
configs. It looks like when we have CONFIG_NET=n and
CONFIG_BPF{,_SYSCALL}=y, we are missing the bpf_base_func_proto
definition (from net/core/filter.c) in cgroup_base_func_proto.

I'm reshuffling the code a bit to make it work. The common helpers
are moved into kernel/bpf/helpers.c and the bpf_base_func_proto is
exported from there.
Also, bpf_get_raw_cpu_id goes into kernel/bpf/core.c akin to existing
bpf_user_rnd_u32.

[1] https://lore.kernel.org/linux-next/CAKH8qBsBvKHswiX1nx40LgO+BGeTmb1NX8tiTttt_0uu6T3dCA@mail.gmail.com/T/#mff8b0c083314c68c2e2ef0211cb11bc20dc13c72

Fixes: 0456ea170c ("bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}")
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200424235941.58382-1-sdf@google.com
2020-04-26 08:53:13 -07:00
Stanislav Fomichev
0456ea170c bpf: Enable more helpers for BPF_PROG_TYPE_CGROUP_{DEVICE,SYSCTL,SOCKOPT}
Currently the following prog types don't fall back to bpf_base_func_proto()
(instead they have cgroup_base_func_proto which has a limited set of
helpers from bpf_base_func_proto):
* BPF_PROG_TYPE_CGROUP_DEVICE
* BPF_PROG_TYPE_CGROUP_SYSCTL
* BPF_PROG_TYPE_CGROUP_SOCKOPT

I don't see any specific reason why we shouldn't use bpf_base_func_proto(),
every other type of program (except bpf-lirc and, understandably, tracing)
use it, so let's fall back to bpf_base_func_proto for those prog types
as well.

This basically boils down to adding access to the following helpers:
* BPF_FUNC_get_prandom_u32
* BPF_FUNC_get_smp_processor_id
* BPF_FUNC_get_numa_node_id
* BPF_FUNC_tail_call
* BPF_FUNC_ktime_get_ns
* BPF_FUNC_spin_lock (CAP_SYS_ADMIN)
* BPF_FUNC_spin_unlock (CAP_SYS_ADMIN)
* BPF_FUNC_jiffies64 (CAP_SYS_ADMIN)

I've also added bpf_perf_event_output() because it's really handy for
logging and debugging.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200420174610.77494-1-sdf@google.com
2020-04-26 08:40:01 -07:00
Eric Dumazet
52a90612fa net: remove obsolete comment
Commit b656722906 ("net: Increase the size of skb_frag_t")
removed the 16bit limitation of a frag on some 32bit arches.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:49:32 -07:00
Tonghao Zhang
659d4587fe net: openvswitch: use div_u64() for 64-by-32 divisions
Compile the kernel for arm 32 platform, the build warning found.
To fix that, should use div_u64() for divisions.
| net/openvswitch/meter.c:396: undefined reference to `__udivdi3'

[add more commit msg, change reported tag, and use div_u64 instead
of do_div by Tonghao]

Fixes: e57358873b ("net: openvswitch: use u64 for meter bucket")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:48:21 -07:00
Tonghao Zhang
4b36a0dff7 net: openvswitch: suitable access to the dp_meters
To fix the following sparse warning:
| net/openvswitch/meter.c:109:38: sparse: sparse: incorrect type
| in assignment (different address spaces) ...
| net/openvswitch/meter.c:720:45: sparse: sparse: incorrect type
| in argument 1 (different address spaces) ...

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:48:21 -07:00
YueHaibing
df346f1aac dccp: remove unused inline function dccp_set_seqno
There's no callers in-tree since commit 792b48780e ("dccp: Implement
both feature-local and feature-remote Sequence Window feature")

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:42:57 -07:00
Taehee Yoo
f30e472071 hsr: remove unnecessary code in hsr_dev_change_mtu()
In the hsr_dev_change_mtu(), the 'dev' and 'master->dev' pointer are
same. So, the 'master' variable and some code are unnecessary.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:40:57 -07:00
Paolo Abeni
1200832c6e mptcp: fix race in msk status update
Currently subflow_finish_connect() changes unconditionally
any msk socket status other than TCP_ESTABLISHED.

If an unblocking connect() races with close(), we can end-up
triggering:

IPv4: Attempt to release TCP socket in state 1 00000000e32b8b7e

when the msk socket is disposed.

Be sure to enter the established status only from SYN_SENT.

Fixes: c3c123d16c ("net: mptcp: don't hang in mptcp_sendmsg() after TCP fallback")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:38:54 -07:00
Florian Westphal
071c8ed6e8 tcp: mptcp: use mptcp receive buffer space to select rcv window
In MPTCP, the receive window is shared across all subflows, because it
refers to the mptcp-level sequence space.

MPTCP receivers already place incoming packets on the mptcp socket
receive queue and will charge it to the mptcp socket rcvbuf until
userspace consumes the data.

Update __tcp_select_window to use the occupancy of the parent/mptcp
socket instead of the subflow socket in case the tcp socket is part
of a logical mptcp connection.

This commit doesn't change choice of initial window for passive or active
connections.
While it would be possible to change those as well, this adds complexity
(especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
of a TCP segment at the connection level if it does not also carry a DSS
option with a Data ACK field.'

SYN/SYNACK packets do not carry a DSS option with a Data ACK field.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:37:52 -07:00
David S. Miller
d483389678 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes to linux/vermagic.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:18:53 -07:00
Linus Torvalds
ab51cac00e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix memory leak in netfilter flowtable, from Roi Dayan.

 2) Ref-count leaks in netrom and tipc, from Xiyu Yang.

 3) Fix warning when mptcp socket is never accepted before close, from
    Florian Westphal.

 4) Missed locking in ovs_ct_exit(), from Tonghao Zhang.

 5) Fix large delays during PTP synchornization in cxgb4, from Rahul
    Lakkireddy.

 6) team_mode_get() can hang, from Taehee Yoo.

 7) Need to use kvzalloc() when allocating fw tracer in mlx5 driver,
    from Niklas Schnelle.

 8) Fix handling of bpf XADD on BTF memory, from Jann Horn.

 9) Fix BPF_STX/BPF_B encoding in x86 bpf jit, from Luke Nelson.

10) Missing queue memory release in iwlwifi pcie code, from Johannes
    Berg.

11) Fix NULL deref in macvlan device event, from Taehee Yoo.

12) Initialize lan87xx phy correctly, from Yuiko Oshino.

13) Fix looping between VRF and XFRM lookups, from David Ahern.

14) etf packet scheduler assumes all sockets are full sockets, which is
    not necessarily true. From Eric Dumazet.

15) Fix mptcp data_fin handling in RX path, from Paolo Abeni.

16) fib_select_default() needs to handle nexthop objects, from David
    Ahern.

17) Use GFP_ATOMIC under spinlock in mac80211_hwsim, from Wei Yongjun.

18) vxlan and geneve use wrong nlattr array, from Sabrina Dubroca.

19) Correct rx/tx stats in bcmgenet driver, from Doug Berger.

20) BPF_LDX zero-extension is encoded improperly in x86_32 bpf jit, fix
    from Luke Nelson.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (100 commits)
  selftests/bpf: Fix a couple of broken test_btf cases
  tools/runqslower: Ensure own vmlinux.h is picked up first
  bpf: Make bpf_link_fops static
  bpftool: Respect the -d option in struct_ops cmd
  selftests/bpf: Add test for freplace program with expected_attach_type
  bpf: Propagate expected_attach_type when verifying freplace programs
  bpf: Fix leak in LINK_UPDATE and enforce empty old_prog_fd
  bpf, x86_32: Fix logic error in BPF_LDX zero-extension
  bpf, x86_32: Fix clobbering of dst for BPF_JSET
  bpf, x86_32: Fix incorrect encoding in BPF_LDX zero-extension
  bpf: Fix reStructuredText markup
  net: systemport: suppress warnings on failed Rx SKB allocations
  net: bcmgenet: suppress warnings on failed Rx SKB allocations
  macsec: avoid to set wrong mtu
  mac80211: sta_info: Add lockdep condition for RCU list usage
  mac80211: populate debugfs only after cfg80211 init
  net: bcmgenet: correct per TX/RX ring statistics
  net: meth: remove spurious copyright text
  net: phy: bcm84881: clear settings on link down
  chcr: Fix CPU hard lockup
  ...
2020-04-24 19:17:30 -07:00
Russell King
5c05c1dbb1 net: phylink, dsa: eliminate phylink_fixed_state_cb()
Move the callback into the phylink_config structure, rather than
providing a callback to set this up.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24 16:45:37 -07:00
Jesper Dangaard Brouer
b70ba69ef1 net: sched: report ndo_setup_tc failures via extack
Help end-users of the 'tc' command to see if the drivers ndo_setup_tc
function call fails. Troubleshooting when this happens is non-trivial
(see full process here[1]), and results in net_device getting assigned
the 'qdisc noop', which will drop all TX packets on the interface.

[1]: https://github.com/xdp-project/xdp-project/blob/master/areas/arm64/board_nxp_ls1088/nxp-board04-troubleshoot-qdisc.org

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Tested-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24 16:44:54 -07:00
David S. Miller
c651b461b5 Just three changes:
* fix a wrong GFP_KERNEL in hwsim
  * fix the debugfs mess after the mac80211 registration race fix
  * suppress false-positive RCU list lockdep warnings
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl6itGEACgkQB8qZga/f
 l8RZAw/9HA9jyy/ZigEhnDG4dfYV5k92zLCzSY6Mo2EmGB+gy6KCTNyDTIVsKunM
 FA0a3gSOuir/VNGL+rCzYw8epCfJaJUbIbnXz0y5bYWcZKzc9gx+nb4vDKK/sMLN
 d4qKJwH4upkfAOBHo9P3ZdQ2X9YNLLZ9gq17IwUJnLZh4rBUKBQsFbs+TkyCg8bZ
 PmWCm6ZAK28nKUQGfJgxFX3SMz/CuMn86SBX4SjG6iKOnWl0XBI2u61ODRKJZ+A5
 cFoZFAYWn3Dx8z04PFuxgDR93wFtekNfvm5+mlu0CDT3NuohkJ+SO4RhfERkvDU8
 6SCnxmqXjP3SaRalbnwgUZPOGB0L1bSXuJTVvz1kOq2gQMXWbqtKMuhQJTijiTHi
 VPGRId28eH5JyWlb5THj0zP7uIxf05VxaEYszGDgrpvzK/jycxMt3oyq2KYUadCy
 idCRBoYuegMumqN4mwgYuecgTJ6hbRGPuXFUlGUVVcZ0APB4xkP/rWu9WmHsdNhB
 Q1+AfsjebgSv5upR78AqTWNrYyvkayoHYeck2e+4cFW5g1Mm4sE8sPWZdoU8SR73
 LZaEtH6/Tl3sqxgNYcsET+VF0HXReB11T0fjxZDKHnENaSSt1MtJgfXhoQRe/Zyw
 FM3GRRxGFtt2Ic9QOchQF+sVNBKSnYuVvYJhVrIOGZLwif1wSRk=
 =SKNj
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-04-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just three changes:
 * fix a wrong GFP_KERNEL in hwsim
 * fix the debugfs mess after the mac80211 registration race fix
 * suppress false-positive RCU list lockdep warnings
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24 13:17:01 -07:00
YueHaibing
e73f94d1b6 batman-adv: remove unused inline function batadv_arp_change_timeout
There's no callers in-tree.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-24 15:22:41 +02:00
Johannes Berg
b6b5c42e3b mac80211: fix two missing documentation entries
Add documentation for two struct entries that was missing.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200424123945.6b23a26ab5e7.I664440ab5f33442df8103253bf5b9fe84be8d58c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:40:20 +02:00
Johannes Berg
be689f68d0 cfg80211: reject channels/chandefs with KHz offset >= 1000
This should be covered by the next MHz, make sure that the
numbers are always normalized.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200424120103.12b91ecf75f9.I4bf499d58404283bbfacb517d614a816763bccf2@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:38:20 +02:00
Thomas Pedersen
3b23c184f7 mac80211: add freq_offset to RX status
RX status needs a KHz component, so add freq_offset. We
can reduce the bits for the frequency since 60 GHz isn't
supported.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200402011810.22947-5-thomas@adapt-ip.com
[fix commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Thomas Pedersen
b6011960f3 mac80211: handle channel frequency offset
cfg80211_chan_def and ieee80211_channel recently gained a
frequency offset component. Handle this where it makes
sense (potentially required by S1G channels).

For IBSS, TDLS, CSA, and ROC we return -EOPNOTSUPP if a
channel with frequency offset is passed, since they may or
may not work. Once someone tests and verifies these
commands work on thos types of channels, we can remove
that error.

join_ocb and join_mesh look harmless because they use a
simple ieee80211_vif_use_channel(), which is using an
already verified channel, so we let those through.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200402011810.22947-4-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Thomas Pedersen
934f4c7dd3 cfg80211: express channels with a KHz component
Some bands (S1G) define channels centered on a non-integer
MHz. Give ieee80211_channel and cfg80211_chan_def a
freq_offset component where the final frequency can be
expressed as:

MHZ_TO_KHZ(chan->center_freq) + chan->freq_offset;

Also provide some helper functions to do the frequency
conversion and test for equality.

Retain the existing interface to frequency and channel
conversion helpers, and expose new ones which handle
frequencies in units of KHz.

Some internal functions (net/wireless/chan.c) pass around
a frequency value. Convert these to units of KHz.

mesh, ibss, wext, etc. are currently ignored.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200402011810.22947-3-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Johannes Berg
dba25b04c6 mac80211: minstrel_ht_assign_best_tp_rates: remove redundant test
We know this pointer isn't NULL and in fact dereferenced it before,
remove the redundant test.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.adf551928846.Iae9015573d6c350cc1b12a311d6d13d086beec6c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Ilan Peer
302ff8b7a2 mac80211: Fail association when AP has no legacy rates
The MLME logic had a workaround that allowed to continue an
association with an AP even if the AP did not provide any basic
rates in its supported rates in the association response, assuming
that the first (non basic) legacy rate could be used as a basic rate.
However, this did not consider the case where the AP (which is
obviously buggy) did not provide any legacy rate.

Fix this by failing the association, as this can result in
an unexpected failure in the low level driver and FW, e.g., in
rate scale logic etc.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.d70a1450d83f.I6e6ce5efda351a8544c0e7bfeee260fe3360d401@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Mordechay Goodstein
0c197f16f7 mac80211: agg-tx: add an option to defer ADDBA transmit
Driver tells mac80211 to sends ADDBA with SSN (starting sequence number)
from the head of the queue, while the transmission of all the frames in the
queue may take a while, which causes the peer to time out. In order to
fix this scenario, add an option to defer ADDBA transmit until queue
is drained.

Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.0f27423fec75.If67daab123a27c1cbddef000d6a3f212aa6309ef@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Mordechay Goodstein
31d8bb4e07 mac80211: agg-tx: refactor sending addba
We move the actual arming the timer and sending ADDBA to a function
for the use in different places calling the same logic.

Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.58a337eb90a1.I75934e6464535fbf43969acc796bc886291e79a5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Ilan Peer
4826e72110 mac80211: Skip entries with HE membership selector
When parsing supported rates IE.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.ed3e66f8c197.I93aad0e5ddb7ce79f05f8153922acb9aa5076d38@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Ilan Peer
2a392596d8 cfg80211: Parse HE membership selector
This extends the support for drivers that rebuilds IEs in the
FW (same as with HT/VHT).

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.20feaabfb484.I886252639604c8e3e84b8ef97962f1b0e4beec81@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:42 +02:00
Andrei Otcheretianski
a4055e74a2 mac80211: Don't destroy auth data in case of anti-clogging
SAE AP may reject authentication with WLAN_STATUS_ANTI_CLOG_REQUIRED.
As the user space will immediately continue the authentication flow,
there is no need to destroy the authentication data in this case.
This saves unneeded station removal and releasing the channel.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.7483996157a8.I8040a842874aaf6d209df3fc8a2acb97a0bf508b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:42 +02:00
Shaul Triebitz
d46b4ab870 mac80211: add twt_protected flag to the bss_conf structure
Add a flag to the BSS conf whether the BSS and STA support protected TWT.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.1dcb2d16fa74.I74d7c007dad2601d2e39f54612fe6554dd5ab386@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:42 +02:00
Johannes Berg
9166cc4976 mac80211: implement Operating Mode Notification extended NSS support
Somehow we missed this for a long time, but similar to the extended
NSS support in VHT capabilities, we need to have this in Operating
Mode notification.

Implement it by
 * parsing the 160/80+80 bit there and setting the bandwidth
   appropriately
 * having callers of ieee80211_get_vht_max_nss() pass in the current
   max NSS value as received in the operating mode notification in
   order to modify it appropriately depending on the extended NSS
   bits.

This updates all drivers that use it, i.e. only iwlwifi/mvm.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.098483728cfa.I4e8c25d3288441759c2793247197229f0696a37d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:42 +02:00
Jouni Malinen
873b1cf611 mac80211: Process multicast RX registration for Action frames
Convert a user space registration for processing multicast Action frames
(NL80211_CMD_REGISTER_FRAME with NL80211_ATTR_RECEIVE_MULTICAST) to a
new enum ieee80211_filter_flags bit FIF_MCAST_ACTION so that drivers can
update their RX filter parameters appropriately, if needed.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200421144815.19175-1-jouni@codeaurora.org
[rename variables to rx_mcast_action_reg indicating action frames only]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:41 +02:00
Johannes Berg
155d7c7338 nl80211: allow client-only BIGTK support
The current NL80211_EXT_FEATURE_BEACON_PROTECTION feature flag
requires both AP and client support, add a new one called
NL80211_EXT_FEATURE_BEACON_PROTECTION_CLIENT that enables only
support in client (and P2P-client) modes.

Link: https://lore.kernel.org/r/20200420140559.6ba704053a5a.Ifeb869fb0b48e52fe0cb9c15572b93ac8a924f8d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:41 +02:00
Johannes Berg
9dba48a6ec cfg80211: support multicast RX registration
For DPP, there's a need to receive multicast action frames,
but many drivers need a special filter configuration for this.

Support announcing from userspace in the management registration
that multicast RX is required, with an extended feature flag if
the driver handles this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200417124013.c46238801048.Ib041d437ce0bff28a0c6d5dc915f68f1d8591002@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:41 +02:00
Johannes Berg
6cd536fe62 cfg80211: change internal management frame registration API
Almost all drivers below cfg80211 get the API wrong (except for
cfg80211) and are unable to cope with multiple registrations for
the same frame type, which is valid due to the match filter.
This seems to indicate the API is wrong, and we should maintain
the full information in cfg80211 instead of the drivers.

Change the API to no longer inform the driver about individual
registrations and unregistrations, but rather every time about
the entire state of the entire wiphy and single wdev, whenever
it may have changed. This also simplifies the code in cfg80211
as it no longer has to track exactly what was unregistered and
can free things immediately.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200417124300.f47f3828afc8.I7f81ef59c2c5a340d7075fb3c6d0e08e8aeffe07@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:40 +02:00
Jouni Malinen
9eaf183af7 mac80211: Report beacon protection failures to user space
Report received Beacon frames that do not have a valid MME MIC when
beacon protection is enabled. This covers both the cases of no MME in
the received frame and invalid MIC in the MME.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200401142548.6990-2-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:40 +02:00
Jouni Malinen
4d797fce78 cfg80211: Unprotected Beacon frame RX indication
Extend cfg80211_rx_unprot_mlme_mgmt() to cover indication of unprotected
Beacon frames in addition to the previously used Deauthentication and
Disassociation frames. The Beacon frame case is quite similar, but has
couple of exceptions: this is used both with fully unprotected and also
incorrectly protected frames and there is a rate limit on the events to
avoid unnecessary flooding netlink events in case something goes wrong.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200401142548.6990-1-jouni@codeaurora.org
[add missing kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:40 +02:00
Johannes Berg
90e8f58dfc mac80211: fix drv_config_iface_filter() behaviour
There are two bugs with this, first, it shouldn't be called
on an interface that's down, and secondly, it should then be
called when the interface comes up.

Note that the currently only user (iwlwifi) doesn't seem to
care about either of these scenarios.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200417111830.401d82c7a0bf.I5dc7d718816460c2d8d89c7af6c215f9e2b3078f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 11:46:17 +02:00
Johannes Berg
1db364c886 mac80211: mlme: remove duplicate AID bookkeeping
Maintain the connection AID only in sdata->vif.bss_conf.aid, not
also in sdata->u.mgd.aid.

Keep setting that where we set ifmgd->aid before, which has the
side effect of exposing the AID to the driver before the station
entry (AP) is marked associated, in case it needs it then.

Requested-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Tested-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/20200417123802.085d4a322b0c.I2e7a2ceceea8c6880219f9e9ee4d4ac985fd295a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 11:46:11 +02:00
Madhuparna Bhowmik
8ca47eb9f9 mac80211: sta_info: Add lockdep condition for RCU list usage
The function sta_info_get_by_idx() uses RCU list primitive.
It is called with  local->sta_mtx held from mac80211/cfg.c.
Add lockdep expression to avoid any false positive RCU list warnings.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Link: https://lore.kernel.org/r/20200409082906.27427-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 11:31:20 +02:00
Johannes Berg
6cb5f3ea46 mac80211: populate debugfs only after cfg80211 init
When fixing the initialization race, we neglected to account for
the fact that debugfs is initialized in wiphy_register(), and
some debugfs things went missing (or rather were rerooted to the
global debugfs root).

Fix this by adding debugfs entries only after wiphy_register().
This requires some changes in the rate control code since it
currently adds debugfs at alloc time, which can no longer be
done after the reordering.

Reported-by: Jouni Malinen <j@w1.fi>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Reported-by: Hauke Mehrtens <hauke@hauke-m.de>
Reported-by: Felix Fietkau <nbd@nbd.name>
Cc: stable@vger.kernel.org
Fixes: 52e04b4ce5 ("mac80211: fix race in ieee80211_register_hw()")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/20200423111344.0e00d3346f12.Iadc76a03a55093d94391fc672e996a458702875d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 11:30:13 +02:00
Tonghao Zhang
e57358873b net: openvswitch: use u64 for meter bucket
When setting the meter rate to 4+Gbps, there is an
overflow, the meters don't work as expected.

Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Andy Zhou <azhou@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 18:26:11 -07:00
Tonghao Zhang
c773500890 net: openvswitch: make EINVAL return value more obvious
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Andy Zhou <azhou@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 18:26:11 -07:00
Tonghao Zhang
a8e387384f net: openvswitch: remove the unnecessary check
Before invoking the ovs_meter_cmd_reply_stats, "meter"
was checked, so don't check it agin in that function.

Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Andy Zhou <azhou@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 18:26:11 -07:00
Tonghao Zhang
eb58eebc7f net: openvswitch: set max limitation to meters
Don't allow user to create meter unlimitedly, which may cause
to consume a large amount of kernel memory. The max number
supported is decided by physical memory and 20K meters as default.

Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Andy Zhou <azhou@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 18:26:11 -07:00
Tonghao Zhang
c7c4c44c9a net: openvswitch: expand the meters supported number
In kernel datapath of Open vSwitch, there are only 1024
buckets of meter in one datapath. If installing more than
1024 (e.g. 8192) meters, it may lead to the performance drop.
But in some case, for example, Open vSwitch used as edge
gateway, there should be 20K at least, where meters used for
IP address bandwidth limitation.

[Open vSwitch userspace datapath has this issue too.]

For more scalable meter, this patch use meter array instead of
hash tables, and expand/shrink the array when necessary. So we
can install more meters than before in the datapath.
Introducing the struct *dp_meter_instance, it's easy to
expand meter though changing the *ti point in the struct
*dp_meter_table.

Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Andy Zhou <azhou@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 18:26:11 -07:00
Xu Wang
3c9143d968 net: sched : Remove unnecessary cast in kfree
Remove unnecassary casts in the argument to kfree.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 15:50:50 -07:00
Xiyu Yang
f35d12971b net/x25: Fix x25_neigh refcnt leak when receiving frame
x25_lapb_receive_frame() invokes x25_get_neigh(), which returns a
reference of the specified x25_neigh object to "nb" with increased
refcnt.

When x25_lapb_receive_frame() returns, local variable "nb" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one path of
x25_lapb_receive_frame(). When pskb_may_pull() returns false, the
function forgets to decrease the refcnt increased by x25_get_neigh(),
causing a refcnt leak.

Fix this issue by calling x25_neigh_put() when pskb_may_pull() returns
false.

Fixes: cb101ed2c3 ("x25: Handle undersized/fragmented skbs")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 15:39:39 -07:00
Bo YU
b4e0f9a926 mptcp/pm_netlink.c : add check for nla_put_in/6_addr
Normal there should be checked for nla_put_in6_addr like other
usage in net.

Detected by CoverityScan, CID# 1461639

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Signed-off-by: Bo YU <tsu.yubo@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 15:38:10 -07:00
Eric Dumazet
7e417a66b8 net: napi: use READ_ONCE()/WRITE_ONCE()
gro_flush_timeout and napi_defer_hard_irqs can be read
from napi_complete_done() while other cpus write the value,
whithout explicit synchronization.

Use READ_ONCE()/WRITE_ONCE() to annotate the races.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 12:43:20 -07:00
Eric Dumazet
6f8b12d661 net: napi: add hard irqs deferral feature
Back in commit 3b47d30396 ("net: gro: add a per device gro flush timer")
we added the ability to arm one high resolution timer, that we used
to keep not-complete packets in GRO engine a bit longer, hoping that further
frames might be added to them.

Since then, we added the napi_complete_done() interface, and commit
364b605573 ("net: busy-poll: return busypolling status to drivers")
allowed drivers to avoid re-arming NIC interrupts if we made a promise
that their NAPI poll() handler would be called in the near future.

This infrastructure can be leveraged, thanks to a new device parameter,
which allows to arm the napi hrtimer, instead of re-arming the device
hard IRQ.

We have noticed that on some servers with 32 RX queues or more, the chit-chat
between the NIC and the host caused by IRQ delivery and re-arming could hurt
throughput by ~20% on 100Gbit NIC.

In contrast, hrtimers are using local (percpu) resources and might have lower
cost.

The new tunable, named napi_defer_hard_irqs, is placed in the same hierarchy
than gro_flush_timeout (/sys/class/net/ethX/)

By default, both gro_flush_timeout and napi_defer_hard_irqs are zero.

This patch does not change the prior behavior of gro_flush_timeout
if used alone : NIC hard irqs should be rearmed as before.

One concrete usage can be :

echo 20000 >/sys/class/net/eth1/gro_flush_timeout
echo 10 >/sys/class/net/eth1/napi_defer_hard_irqs

If at least one packet is retired, then we will reset napi counter
to 10 (napi_defer_hard_irqs), ensuring at least 10 periodic scans
of the queue.

On busy queues, this should avoid NIC hard IRQ, while before this patch IRQ
avoidance was only possible if napi->poll() was exhausting its budget
and not call napi_complete_done().

This feature also can be used to work around some non-optimal NIC irq
coalescing strategies.

Having the ability to insert XX usec delays between each napi->poll()
can increase cache efficiency, since we increase batch sizes.

It also keeps serving cpus not idle too long, reducing tail latencies.

Co-developed-by: Luigi Rizzo <lrizzo@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 12:43:20 -07:00
Alexander Lobakin
e131a56348 net: dsa: add GRO support via gro_cells
gro_cells lib is used by different encapsulating netdevices, such as
geneve, macsec, vxlan etc. to speed up decapsulated traffic processing.
CPU tag is a sort of "encapsulation", and we can use the same mechs to
greatly improve overall DSA performance.
skbs are passed to the GRO layer after removing CPU tags, so we don't
need any new packet offload types as it was firstly proposed by me in
the first GRO-over-DSA variant [1].

The size of struct gro_cells is sizeof(void *), so hot struct
dsa_slave_priv becomes only 4/8 bytes bigger, and all critical fields
remain in one 32-byte cacheline.
The other positive side effect is that drivers for network devices
that can be shipped as CPU ports of DSA-driven switches can now use
napi_gro_frags() to pass skbs to kernel. Packets built that way are
completely non-linear and are likely being dropped without GRO.

This was tested on to-be-mainlined-soon Ethernet driver that uses
napi_gro_frags(), and the overall performance was on par with the
variant from [1], sometimes even better due to minimal overhead.
net.core.gro_normal_batch tuning may help to push it to the limit
on particular setups and platforms.

iperf3 IPoE VLAN NAT TCP forwarding (port1.218 -> port0) setup
on 1.2 GHz MIPS board:

5.7-rc2 baseline:

[ID]  Interval         Transfer     Bitrate        Retr
[ 5]  0.00-120.01 sec  9.00 GBytes  644 Mbits/sec  413  sender
[ 5]  0.00-120.00 sec  8.99 GBytes  644 Mbits/sec       receiver

Iface      RX packets  TX packets
eth0       7097731     7097702
port0      426050      6671829
port1      6671681     425862
port1.218  6671677     425851

With this patch:

[ID]  Interval         Transfer     Bitrate        Retr
[ 5]  0.00-120.01 sec  12.2 GBytes  870 Mbits/sec  122  sender
[ 5]  0.00-120.00 sec  12.2 GBytes  870 Mbits/sec       receiver

Iface      RX packets  TX packets
eth0       9474792     9474777
port0      455200      353288
port1      9019592     455035
port1.218  353144      455024

v2:
 - Add some performance examples in the commit message;
 - No functional changes.

[1] https://lore.kernel.org/netdev/20191230143028.27313-1-alobakin@dlink.ru/

Signed-off-by: Alexander Lobakin <bloodyreaper@yandex.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 12:32:13 -07:00
Fernando Gont
b75326c201 ipv6: Honor all IPv6 PIO Valid Lifetime values
RFC4862 5.5.3 e) prevents received Router Advertisements from reducing
the Valid Lifetime of configured addresses to less than two hours, thus
preventing hosts from reacting to the information provided by a router
that has positive knowledge that a prefix has become invalid.

This patch makes hosts honor all Valid Lifetime values, as per
draft-gont-6man-slaac-renum-06, Section 4.2. This is meant to help
mitigate the problem discussed in draft-ietf-v6ops-slaac-renum.

Note: Attacks aiming at disabling an advertised prefix via a Valid
Lifetime of 0 are not really more harmful than other attacks
that can be performed via forged RA messages, such as those
aiming at completely disabling a next-hop router via an RA that
advertises a Router Lifetime of 0, or performing a Denial of
Service (DoS) attack by advertising illegitimate prefixes via
forged PIOs.  In scenarios where RA-based attacks are of concern,
proper mitigations such as RA-Guard [RFC6105] [RFC7113] should
be implemented.

Signed-off-by: Fernando Gont <fgont@si6networks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 12:29:21 -07:00
Linus Torvalds
1ddd873948 Fixes:
- Address several use-after-free and memory leak bugs
 
 - Prevent a backchannel livelock
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJemdwYAAoJEDNqszNvZn+XU1oQAKOm9vypO6w252kXdhFSxAlB
 3tMxXALNDrFP3PXsKCa/sKKMRvkUkx+9pdnTuXDPvffd3ZgyB8DzJilryEtiqT4Y
 JsuoWHg2QyNeKUFGmtZ5AsefPaR8WL/aiYPTi1PUqnq4rNPjAgOGgLUv+LME2jFU
 Yx773d5CNHXDq6zv1Au0128URnQZDy/7URdfgX1FhLA8aQWjiG08fhBEGncXjV/X
 mo3RMCwE2uzNRruW7OJyCehb8d+IKBDZ0LEeZDW/ve4hNtL+Ke5eCEoemYtUN07e
 U3gRMB8Pt+55L+ZFP8KJYOtfRx2SkOTMcbASC2z/WECq5vumGmn4WovSSVJFGIUN
 5WVf8ADM2w3RmTFh11Jl5mZnziGRNY/4hAW7PrR4ZDhJxjdKA+iLLd7571kkCE63
 II6qxw/WV7Yz3T6v4BoOcDf1DOylnS1JXqmPGYia2aAhyFZgRVasOVIkB0meaaFe
 zSKzKsTrir1Ru8/xt5zIgyEQwqATp2rwzkoPuTeQZLOht0fsSIGBpD1ZWXUaMAji
 cfojhd4731cvoxMMGG27IMiHTG6rpKneaZ21Z/7/61P+cjHm/ITOLZzzRvhQMQU7
 wuskRf3KTs+3k4x6P9E0qQU1DcJkPSYGq+JDdh389Plald4MLTAZYjIK+J3X35oL
 QNnUeKzr1YhWWqgchthG
 =Zoup
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd fixes from Chuck Lever:
 "The first set of 5.7-rc fixes for NFS server issues.

  These were all unresolved at the time the 5.7 window opened, and
  needed some additional time to ensure they were correctly addressed.
  They are ready now.

  At the moment I know of one more urgent issue regarding the NFS
  server. A fix has been tested and is under review. I expect to send
  one more pull request, containing this fix (which now consists of 3
  patches).

  Fixes:

   - Address several use-after-free and memory leak bugs

   - Prevent a backchannel livelock"

* tag 'nfsd-5.7-rc-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
  svcrdma: Fix leak of svc_rdma_recv_ctxt objects
  svcrdma: Fix trace point use-after-free race
  SUNRPC: Fix backchannel RPC soft lockups
  SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge
  nfsd: memory corruption in nfsd4_lock()
2020-04-23 09:33:43 -07:00
Nicolas Dichtel
c95c5f58b3 xfrm interface: fix oops when deleting a x-netns interface
Here is the steps to reproduce the problem:
ip netns add foo
ip netns add bar
ip -n foo link add xfrmi0 type xfrm dev lo if_id 42
ip -n foo link set xfrmi0 netns bar
ip netns del foo
ip netns del bar

Which results to:
[  186.686395] general protection fault, probably for non-canonical address 0x6b6b6b6b6b6b6bd3: 0000 [#1] SMP PTI
[  186.687665] CPU: 7 PID: 232 Comm: kworker/u16:2 Not tainted 5.6.0+ #1
[  186.688430] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[  186.689420] Workqueue: netns cleanup_net
[  186.689903] RIP: 0010:xfrmi_dev_uninit+0x1b/0x4b [xfrm_interface]
[  186.690657] Code: 44 f6 ff ff 31 c0 5b 5d 41 5c 41 5d 41 5e c3 48 8d 8f c0 08 00 00 8b 05 ce 14 00 00 48 8b 97 d0 08 00 00 48 8b 92 c0 0e 00 00 <48> 8b 14 c2 48 8b 02 48 85 c0 74 19 48 39 c1 75 0c 48 8b 87 c0 08
[  186.692838] RSP: 0018:ffffc900003b7d68 EFLAGS: 00010286
[  186.693435] RAX: 000000000000000d RBX: ffff8881b0f31000 RCX: ffff8881b0f318c0
[  186.694334] RDX: 6b6b6b6b6b6b6b6b RSI: 0000000000000246 RDI: ffff8881b0f31000
[  186.695190] RBP: ffffc900003b7df0 R08: ffff888236c07740 R09: 0000000000000040
[  186.696024] R10: ffffffff81fce1b8 R11: 0000000000000002 R12: ffffc900003b7d80
[  186.696859] R13: ffff8881edcc6a40 R14: ffff8881a1b6e780 R15: ffffffff81ed47c8
[  186.697738] FS:  0000000000000000(0000) GS:ffff888237dc0000(0000) knlGS:0000000000000000
[  186.698705] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  186.699408] CR2: 00007f2129e93148 CR3: 0000000001e0a000 CR4: 00000000000006e0
[  186.700221] Call Trace:
[  186.700508]  rollback_registered_many+0x32b/0x3fd
[  186.701058]  ? __rtnl_unlock+0x20/0x3d
[  186.701494]  ? arch_local_irq_save+0x11/0x17
[  186.702012]  unregister_netdevice_many+0x12/0x55
[  186.702594]  default_device_exit_batch+0x12b/0x150
[  186.703160]  ? prepare_to_wait_exclusive+0x60/0x60
[  186.703719]  cleanup_net+0x17d/0x234
[  186.704138]  process_one_work+0x196/0x2e8
[  186.704652]  worker_thread+0x1a4/0x249
[  186.705087]  ? cancel_delayed_work+0x92/0x92
[  186.705620]  kthread+0x105/0x10f
[  186.706000]  ? __kthread_bind_mask+0x57/0x57
[  186.706501]  ret_from_fork+0x35/0x40
[  186.706978] Modules linked in: xfrm_interface nfsv3 nfs_acl auth_rpcgss nfsv4 nfs lockd grace fscache sunrpc button parport_pc parport serio_raw evdev pcspkr loop ext4 crc16 mbcache jbd2 crc32c_generic 8139too ide_cd_mod cdrom ide_gd_mod ata_generic ata_piix libata scsi_mod piix psmouse i2c_piix4 ide_core 8139cp i2c_core mii floppy
[  186.710423] ---[ end trace 463bba18105537e5 ]---

The problem is that x-netns xfrm interface are not removed when the link
netns is removed. This causes later this oops when thoses interfaces are
removed.

Let's add a handler to remove all interfaces related to a netns when this
netns is removed.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Reported-by: Christophe Gouault <christophe.gouault@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-23 08:09:39 +02:00
Xin Long
976eba8ab5 ip_vti: receive ipip packet by calling ip_tunnel_rcv
In Commit dd9ee34440 ("vti4: Fix a ipip packet processing bug in
'IPCOMP' virtual tunnel"), it tries to receive IPIP packets in vti
by calling xfrm_input(). This case happens when a small packet or
frag sent by peer is too small to get compressed.

However, xfrm_input() will still get to the IPCOMP path where skb
sec_path is set, but never dropped while it should have been done
in vti_ipcomp4_protocol.cb_handler(vti_rcv_cb), as it's not an
ipcomp4 packet. This will cause that the packet can never pass
xfrm4_policy_check() in the upper protocol rcv functions.

So this patch is to call ip_tunnel_rcv() to process IPIP packets
instead.

Fixes: dd9ee34440 ("vti4: Fix a ipip packet processing bug in 'IPCOMP' virtual tunnel")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-23 08:02:31 +02:00
David Ahern
7c74b0bec9 ipv4: Update fib_select_default to handle nexthop objects
A user reported [0] hitting the WARN_ON in fib_info_nh:

    [ 8633.839816] ------------[ cut here ]------------
    [ 8633.839819] WARNING: CPU: 0 PID: 1719 at include/net/nexthop.h:251 fib_select_path+0x303/0x381
    ...
    [ 8633.839846] RIP: 0010:fib_select_path+0x303/0x381
    ...
    [ 8633.839848] RSP: 0018:ffffb04d407f7d00 EFLAGS: 00010286
    [ 8633.839850] RAX: 0000000000000000 RBX: ffff9460b9897ee8 RCX: 00000000000000fe
    [ 8633.839851] RDX: 0000000000000000 RSI: 00000000ffffffff RDI: 0000000000000000
    [ 8633.839852] RBP: ffff946076049850 R08: 0000000059263a83 R09: ffff9460840e4000
    [ 8633.839853] R10: 0000000000000014 R11: 0000000000000000 R12: ffffb04d407f7dc0
    [ 8633.839854] R13: ffffffffa4ce3240 R14: 0000000000000000 R15: ffff9460b7681f60
    [ 8633.839857] FS:  00007fcac2e02700(0000) GS:ffff9460bdc00000(0000) knlGS:0000000000000000
    [ 8633.839858] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 8633.839859] CR2: 00007f27beb77e28 CR3: 0000000077734000 CR4: 00000000000006f0
    [ 8633.839867] Call Trace:
    [ 8633.839871]  ip_route_output_key_hash_rcu+0x421/0x890
    [ 8633.839873]  ip_route_output_key_hash+0x5e/0x80
    [ 8633.839876]  ip_route_output_flow+0x1a/0x50
    [ 8633.839878]  __ip4_datagram_connect+0x154/0x310
    [ 8633.839880]  ip4_datagram_connect+0x28/0x40
    [ 8633.839882]  __sys_connect+0xd6/0x100
    ...

The WARN_ON is triggered in fib_select_default which is invoked when
there are multiple default routes. Update the function to use
fib_info_nhc and convert the nexthop checks to use fib_nh_common.

Add test case that covers the affected code path.

[0] https://github.com/FRRouting/frr/issues/6089

Fixes: 493ced1ac4 ("ipv4: Allow routes to use nexthop objects")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:57:39 -07:00
Salvatore Bonaccorso
c0259664c6 netlabel: Kconfig: Update reference for NetLabel Tools project
The NetLabel Tools project has moved from http://netlabel.sf.net to a
GitHub project. Update to directly refer to the new home for the tools.

Signed-off-by: Salvatore Bonaccorso <carnil@debian.org>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:55:01 -07:00
Paolo Abeni
9a19371bf0 mptcp: fix data_fin handing in RX path
The data fin flag is set only via a DSS option, but
mptcp_incoming_options() copies it unconditionally from the
provided RX options.

Since we do not clear all the mptcp sock RX options in a
socket free/alloc cycle, we can end-up with a stray data_fin
value while parsing e.g. MPC packets.

That would lead to mapping data corruption and will trigger
a few WARN_ON() in the RX path.

Instead of adding a costly memset(), fetch the data_fin flag
only for DSS packets - when we always explicitly initialize
such bit at option parsing time.

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:46:01 -07:00
Jason Yan
8518307dc2 net: caif: use true,false for bool variables
Fix the following coccicheck warning:

net/caif/caif_dev.c:410:2-13: WARNING: Assignment of 0/1 to bool
variable
net/caif/caif_dev.c:445:2-13: WARNING: Assignment of 0/1 to bool
variable
net/caif/caif_dev.c:145:1-12: WARNING: Assignment of 0/1 to bool
variable
net/caif/caif_dev.c:223:1-12: WARNING: Assignment of 0/1 to bool
variable

Signed-off-by: Jason Yan <yanaijie@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:37:48 -07:00
Jere Leppänen
12dfd78e3a sctp: Fix SHUTDOWN CTSN Ack in the peer restart case
When starting shutdown in sctp_sf_do_dupcook_a(), get the value for
SHUTDOWN Cumulative TSN Ack from the new association, which is
reconstructed from the cookie, instead of the old association, which
the peer doesn't have anymore.

Otherwise the SHUTDOWN is either ignored or replied to with an ABORT
by the peer because CTSN Ack doesn't match the peer's Initial TSN.

Fixes: bdf6fa52f0 ("sctp: handle association restarts when the socket is closed.")
Signed-off-by: Jere Leppänen <jere.leppanen@nokia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:27:40 -07:00
Jere Leppänen
145cb2f717 sctp: Fix bundling of SHUTDOWN with COOKIE-ACK
When we start shutdown in sctp_sf_do_dupcook_a(), we want to bundle
the SHUTDOWN with the COOKIE-ACK to ensure that the peer receives them
at the same time and in the correct order. This bundling was broken by
commit 4ff40b8626 ("sctp: set chunk transport correctly when it's a
new asoc"), which assigns a transport for the COOKIE-ACK, but not for
the SHUTDOWN.

Fix this by passing a reference to the COOKIE-ACK chunk as an argument
to sctp_sf_do_9_2_start_shutdown() and onward to
sctp_make_shutdown(). This way the SHUTDOWN chunk is assigned the same
transport as the COOKIE-ACK chunk, which allows them to be bundled.

In sctp_sf_do_9_2_start_shutdown(), the void *arg parameter was
previously unused. Now that we're taking it into use, it must be a
valid pointer to a chunk, or NULL. There is only one call site where
it's not, in sctp_sf_autoclose_timer_expire(). Fix that too.

Fixes: 4ff40b8626 ("sctp: set chunk transport correctly when it's a new asoc")
Signed-off-by: Jere Leppänen <jere.leppanen@nokia.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:27:40 -07:00
Vladimir Oltean
72579e14a1 net: dsa: don't fail to probe if we couldn't set the MTU
There is no reason to fail the probing of the switch if the MTU couldn't
be configured correctly (either the switch port itself, or the host
port) for whatever reason. MTU-sized traffic probably won't work, sure,
but we can still probably limp on and support some form of communication
anyway, which the users would probably appreciate more.

Fixes: bfcb813203 ("net: dsa: configure the MTU for switch ports")
Reported-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:22:59 -07:00
Eric Dumazet
a1211bf9a7 sched: etf: do not assume all sockets are full blown
skb->sk does not always point to a full blown socket,
we need to use sk_fullsock() before accessing fields which
only make sense on full socket.

BUG: KASAN: use-after-free in report_sock_error+0x286/0x300 net/sched/sch_etf.c:141
Read of size 1 at addr ffff88805eb9b245 by task syz-executor.5/9630

CPU: 1 PID: 9630 Comm: syz-executor.5 Not tainted 5.7.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:382
 __kasan_report.cold+0x35/0x4d mm/kasan/report.c:511
 kasan_report+0x33/0x50 mm/kasan/common.c:625
 report_sock_error+0x286/0x300 net/sched/sch_etf.c:141
 etf_enqueue_timesortedlist+0x389/0x740 net/sched/sch_etf.c:170
 __dev_xmit_skb net/core/dev.c:3710 [inline]
 __dev_queue_xmit+0x154a/0x30a0 net/core/dev.c:4021
 neigh_hh_output include/net/neighbour.h:499 [inline]
 neigh_output include/net/neighbour.h:508 [inline]
 ip6_finish_output2+0xfb5/0x25b0 net/ipv6/ip6_output.c:117
 __ip6_finish_output+0x442/0xab0 net/ipv6/ip6_output.c:143
 ip6_finish_output+0x34/0x1f0 net/ipv6/ip6_output.c:153
 NF_HOOK_COND include/linux/netfilter.h:296 [inline]
 ip6_output+0x239/0x810 net/ipv6/ip6_output.c:176
 dst_output include/net/dst.h:435 [inline]
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ip6_xmit+0xe1a/0x2090 net/ipv6/ip6_output.c:280
 tcp_v6_send_synack+0x4e7/0x960 net/ipv6/tcp_ipv6.c:521
 tcp_rtx_synack+0x10d/0x1a0 net/ipv4/tcp_output.c:3916
 inet_rtx_syn_ack net/ipv4/inet_connection_sock.c:669 [inline]
 reqsk_timer_handler+0x4c2/0xb40 net/ipv4/inet_connection_sock.c:763
 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1405
 expire_timers kernel/time/timer.c:1450 [inline]
 __run_timers kernel/time/timer.c:1774 [inline]
 __run_timers kernel/time/timer.c:1741 [inline]
 run_timer_softirq+0x623/0x1600 kernel/time/timer.c:1787
 __do_softirq+0x26c/0x9f7 kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0x192/0x1d0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:546 [inline]
 smp_apic_timer_interrupt+0x19e/0x600 arch/x86/kernel/apic/apic.c:1140
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
 </IRQ>
RIP: 0010:des_encrypt+0x157/0x9c0 lib/crypto/des.c:792
Code: 85 22 06 00 00 41 31 dc 41 8b 4d 04 44 89 e2 41 83 e4 3f 4a 8d 3c a5 60 72 72 88 81 e2 3f 3f 3f 3f 48 89 f8 48 c1 e8 03 31 d9 <0f> b6 34 28 48 89 f8 c1 c9 04 83 e0 07 83 c0 03 40 38 f0 7c 09 40
RSP: 0018:ffffc90003b5f6c0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff10e4e55 RBX: 00000000d2f846d0 RCX: 00000000d2f846d0
RDX: 0000000012380612 RSI: ffffffff839863ca RDI: ffffffff887272a8
RBP: dffffc0000000000 R08: ffff888091d0a380 R09: 0000000000800081
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000012
R13: ffff8880a8ae8078 R14: 00000000c545c93e R15: 0000000000000006
 cipher_crypt_one crypto/cipher.c:75 [inline]
 crypto_cipher_encrypt_one+0x124/0x210 crypto/cipher.c:82
 crypto_cbcmac_digest_update+0x1b5/0x250 crypto/ccm.c:830
 crypto_shash_update+0xc4/0x120 crypto/shash.c:119
 shash_ahash_update+0xa3/0x110 crypto/shash.c:246
 crypto_ahash_update include/crypto/hash.h:547 [inline]
 hash_sendmsg+0x518/0xad0 crypto/algif_hash.c:102
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x308/0x7e0 net/socket.c:2362
 ___sys_sendmsg+0x100/0x170 net/socket.c:2416
 __sys_sendmmsg+0x195/0x480 net/socket.c:2506
 __do_sys_sendmmsg net/socket.c:2535 [inline]
 __se_sys_sendmmsg net/socket.c:2532 [inline]
 __x64_sys_sendmmsg+0x99/0x100 net/socket.c:2532
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3
RIP: 0033:0x45c829
Code: 0d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 db b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f6d9528ec78 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000004fc080 RCX: 000000000045c829
RDX: 0000000000000001 RSI: 0000000020002640 RDI: 0000000000000004
RBP: 000000000078bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000008d7 R14: 00000000004cb7aa R15: 00007f6d9528f6d4

Fixes: 4b15c70753 ("net/sched: Make etf report drops on error_queue")
Fixes: 25db26a913 ("net/sched: Introduce the ETF Qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 19:20:28 -07:00
Manivannan Sadhasivam
dfddb54043 net: qrtr: Add tracepoint support
Add tracepoint support for QRTR with NS as the first candidate. Later on
this can be extended to core QRTR and transport drivers.

The trace_printk() used in NS has been replaced by tracepoints.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 12:55:54 -07:00
YueHaibing
540bde5c2c ila: remove unused macro 'ILA_HASH_TABLE_SIZE'
net/ipv6/ila/ila_xlat.c:604:0: warning: macro "ILA_HASH_TABLE_SIZE" is not used [-Wunused-macros]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 12:49:21 -07:00
wenxu
beb97d3a31 net/sched: act_ct: update nf_conn_acct for act_ct SW offload in flowtable
When the act_ct SW offload in flowtable, The counter of the conntrack
entry will never update. So update the nf_conn_acct conuter in act_ct
flowtable software offload.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 12:48:17 -07:00
David Ahern
0c922a4850 xfrm: Always set XFRM_TRANSFORMED in xfrm{4,6}_output_finish
IPSKB_XFRM_TRANSFORMED and IP6SKB_XFRM_TRANSFORMED are skb flags set by
xfrm code to tell other skb handlers that the packet has been passed
through the xfrm output functions. Simplify the code and just always
set them rather than conditionally based on netfilter enabled thus
making the flag available for other users.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 12:32:11 -07:00
Xiyu Yang
efe57fd58e SUNRPC: Remove unreachable error condition
rpc_clnt_test_and_add_xprt() invokes rpc_call_null_helper(), which
return the value of rpc_run_task() to "task". Since rpc_run_task() is
impossible to return an ERR pointer, there is no need to add the
IS_ERR() condition on "task" here. So we need to remove it.

Fixes: 7f55489058 ("SUNRPC: Allow addition of new transports to a struct rpc_clnt")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-21 19:11:59 -04:00
Jiri Slaby
526f3d96b8 cgroup, netclassid: remove double cond_resched
Commit 018d26fcd1 ("cgroup, netclassid: periodically release file_lock
on classid") added a second cond_resched to write_classid indirectly by
update_classid_task. Remove the one in write_classid.

Signed-off-by: Jiri Slaby <jslaby@suse.cz>
Cc: Dmitry Yakunin <zeil@yandex-team.ru>
Cc: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21 15:44:30 -07:00
Leon Romanovsky
1c79031f8a drivers: Remove inclusion of vermagic header
Get rid of linux/vermagic.h includes, so that MODULE_ARCH_VERMAGIC from
the arch header arch/x86/include/asm/module.h won't be redefined.

  In file included from ./include/linux/module.h:30,
                   from drivers/net/ethernet/3com/3c515.c:56:
  ./arch/x86/include/asm/module.h:73: warning: "MODULE_ARCH_VERMAGIC"
redefined
     73 | # define MODULE_ARCH_VERMAGIC MODULE_PROC_FAMILY
        |
  In file included from drivers/net/ethernet/3com/3c515.c:25:
  ./include/linux/vermagic.h:28: note: this is the location of the
previous definition
     28 | #define MODULE_ARCH_VERMAGIC ""
        |

Fixes: 6bba2e89a8 ("net/3com: Delete driver and module versions from 3com drivers")
Co-developed-by: Borislav Petkov <bp@suse.de>
Signed-off-by: Borislav Petkov <bp@suse.de>
Acked-by: Shannon Nelson <snelson@pensando.io> # ionic
Acked-by: Sebastian Reichel <sre@kernel.org> # power
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21 13:27:37 -07:00
David S. Miller
76fc6a9a9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) flow_block_cb memleak in nf_flow_table_offload_del_cb(), from Roi Dayan.

2) Fix error path handling in nf_nat_inet_register_fn(), from Hillf Danton.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-21 11:50:31 -07:00
Xiyu Yang
6f91a3f7af batman-adv: Fix refcnt leak in batadv_v_ogm_process
batadv_v_ogm_process() invokes batadv_hardif_neigh_get(), which returns
a reference of the neighbor object to "hardif_neigh" with increased
refcount.

When batadv_v_ogm_process() returns, "hardif_neigh" becomes invalid, so
the refcount should be decreased to keep refcount balanced.

The reference counting issue happens in one exception handling paths of
batadv_v_ogm_process(). When batadv_v_ogm_orig_get() fails to get the
orig node and returns NULL, the refcnt increased by
batadv_hardif_neigh_get() is not decreased, causing a refcnt leak.

Fix this issue by jumping to "out" label when batadv_v_ogm_orig_get()
fails to get the orig node.

Fixes: 9323158ef9 ("batman-adv: OGMv2 - implement originators logic")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:08:05 +02:00
Xiyu Yang
6107c5da0f batman-adv: Fix refcnt leak in batadv_store_throughput_override
batadv_show_throughput_override() invokes batadv_hardif_get_by_netdev(),
which gets a batadv_hard_iface object from net_dev with increased refcnt
and its reference is assigned to a local pointer 'hard_iface'.

When batadv_store_throughput_override() returns, "hard_iface" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The issue happens in one error path of
batadv_store_throughput_override(). When batadv_parse_throughput()
returns NULL, the refcnt increased by batadv_hardif_get_by_netdev() is
not decreased, causing a refcnt leak.

Fix this issue by jumping to "out" label when batadv_parse_throughput()
returns NULL.

Fixes: 0b5ecc6811 ("batman-adv: add throughput override attribute to hard_ifaces")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:08:05 +02:00
Xiyu Yang
f872de8185 batman-adv: Fix refcnt leak in batadv_show_throughput_override
batadv_show_throughput_override() invokes batadv_hardif_get_by_netdev(),
which gets a batadv_hard_iface object from net_dev with increased refcnt
and its reference is assigned to a local pointer 'hard_iface'.

When batadv_show_throughput_override() returns, "hard_iface" becomes
invalid, so the refcount should be decreased to keep refcount balanced.

The issue happens in the normal path of
batadv_show_throughput_override(), which forgets to decrease the refcnt
increased by batadv_hardif_get_by_netdev() before the function returns,
causing a refcnt leak.

Fix this issue by calling batadv_hardif_put() before the
batadv_show_throughput_override() returns in the normal path.

Fixes: 0b5ecc6811 ("batman-adv: add throughput override attribute to hard_ifaces")
Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:08:05 +02:00
George Spelvin
fd0c42c4de batman-adv: fix batadv_nc_random_weight_tq
and change to pseudorandom numbers, as this is a traffic dithering
operation that doesn't need crypto-grade.

The previous code operated in 4 steps:

1. Generate a random byte 0 <= rand_tq <= 255
2. Multiply it by BATADV_TQ_MAX_VALUE - tq
3. Divide by 255 (= BATADV_TQ_MAX_VALUE)
4. Return BATADV_TQ_MAX_VALUE - rand_tq

This would apperar to scale (BATADV_TQ_MAX_VALUE - tq) by a random
value between 0/255 and 255/255.

But!  The intermediate value between steps 3 and 4 is stored in a u8
variable.  So it's truncated, and most of the time, is less than 255, after
which the division produces 0.  Specifically, if tq is odd, the product is
always even, and can never be 255.  If tq is even, there's exactly one
random byte value that will produce a product byte of 255.

Thus, the return value is 255 (511/512 of the time) or 254 (1/512
of the time).

If we assume that the truncation is a bug, and the code is meant to scale
the input, a simpler way of looking at it is that it's returning a random
value between tq and BATADV_TQ_MAX_VALUE, inclusive.

Well, we have an optimized function for doing just that.

Fixes: 3c12de9a5c ("batman-adv: network coding - code and transmit packets if possible")
Signed-off-by: George Spelvin <lkml@sdf.org>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:08:04 +02:00
Sven Eckelmann
26893e7e92 batman-adv: Utilize prandom_u32_max for random [0, max) values
The kernel provides a function to create random values from 0 - (max-1)
since commit f337db64af ("random32: add prandom_u32_max and convert open
coded users"). Simply use this function to replace code sections which use
prandom_u32 and a handcrafted method to map it to the correct range.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
2020-04-21 10:07:31 +02:00
Sven Eckelmann
9204a4f876 batman-adv: trace: Drop unneeded types.h include
The commit 04ae87a520 ("ftrace: Rework event_create_dir()") restructured
various macros in the ftrace framework. These changes also had the nice
side effect that the linux/types.h include is no longer necessary to define
some of the types used by these macros.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:07:31 +02:00
Sven Eckelmann
c08dd06b3d batman-adv: Fix spelling error in term buffer
checkpatch warns about a typo in the word bufFer which was introduced in
commit 2191c1bcbc64 ("batman-adv: kernel doc for types.h").

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:07:31 +02:00
Simon Wunderlich
2a7e978625 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-04-21 10:07:30 +02:00
Xin Long
a204aef9fd xfrm: call xfrm_output_gso when inner_protocol is set in xfrm_output
An use-after-free crash can be triggered when sending big packets over
vxlan over esp with esp offload enabled:

  [] BUG: KASAN: use-after-free in ipv6_gso_pull_exthdrs.part.8+0x32c/0x4e0
  [] Call Trace:
  []  dump_stack+0x75/0xa0
  []  kasan_report+0x37/0x50
  []  ipv6_gso_pull_exthdrs.part.8+0x32c/0x4e0
  []  ipv6_gso_segment+0x2c8/0x13c0
  []  skb_mac_gso_segment+0x1cb/0x420
  []  skb_udp_tunnel_segment+0x6b5/0x1c90
  []  inet_gso_segment+0x440/0x1380
  []  skb_mac_gso_segment+0x1cb/0x420
  []  esp4_gso_segment+0xae8/0x1709 [esp4_offload]
  []  inet_gso_segment+0x440/0x1380
  []  skb_mac_gso_segment+0x1cb/0x420
  []  __skb_gso_segment+0x2d7/0x5f0
  []  validate_xmit_skb+0x527/0xb10
  []  __dev_queue_xmit+0x10f8/0x2320 <---
  []  ip_finish_output2+0xa2e/0x1b50
  []  ip_output+0x1a8/0x2f0
  []  xfrm_output_resume+0x110e/0x15f0
  []  __xfrm4_output+0xe1/0x1b0
  []  xfrm4_output+0xa0/0x200
  []  iptunnel_xmit+0x5a7/0x920
  []  vxlan_xmit_one+0x1658/0x37a0 [vxlan]
  []  vxlan_xmit+0x5e4/0x3ec8 [vxlan]
  []  dev_hard_start_xmit+0x125/0x540
  []  __dev_queue_xmit+0x17bd/0x2320  <---
  []  ip6_finish_output2+0xb20/0x1b80
  []  ip6_output+0x1b3/0x390
  []  ip6_xmit+0xb82/0x17e0
  []  inet6_csk_xmit+0x225/0x3d0
  []  __tcp_transmit_skb+0x1763/0x3520
  []  tcp_write_xmit+0xd64/0x5fe0
  []  __tcp_push_pending_frames+0x8c/0x320
  []  tcp_sendmsg_locked+0x2245/0x3500
  []  tcp_sendmsg+0x27/0x40

As on the tx path of vxlan over esp, skb->inner_network_header would be
set on vxlan_xmit() and xfrm4_tunnel_encap_add(), and the later one can
overwrite the former one. It causes skb_udp_tunnel_segment() to use a
wrong skb->inner_network_header, then the issue occurs.

This patch is to fix it by calling xfrm_output_gso() instead when the
inner_protocol is set, in which gso_segment of inner_protocol will be
done first.

While at it, also improve some code around.

Fixes: 7862b4058b ("esp: Add gso handlers for esp4 and esp6")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-21 07:30:06 +02:00
Xin Long
6f297068a0 esp4: support ipv6 nexthdrs process for beet gso segment
For beet mode, when it's ipv6 inner address with nexthdrs set,
the packet format might be:

    ----------------------------------------------------
    | outer  |     | dest |     |      |  ESP    | ESP |
    | IP hdr | ESP | opts.| TCP | Data | Trailer | ICV |
    ----------------------------------------------------

Before doing gso segment in xfrm4_beet_gso_segment(), the same
thing is needed as it does in xfrm6_beet_gso_segment() in last
patch 'esp6: support ipv6 nexthdrs process for beet gso segment'.

v1->v2:
  - remove skb_transport_offset(), as it will always return 0
    in xfrm6_beet_gso_segment(), thank Sabrina's check.

Fixes: 384a46ea7b ("esp4: add gso_segment for esp4 beet mode")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-21 07:18:14 +02:00
Xin Long
25a44ae93d esp6: support ipv6 nexthdrs process for beet gso segment
For beet mode, when it's ipv6 inner address with nexthdrs set,
the packet format might be:

    ----------------------------------------------------
    | outer  |     | dest |     |      |  ESP    | ESP |
    | IP6 hdr| ESP | opts.| TCP | Data | Trailer | ICV |
    ----------------------------------------------------

Before doing gso segment in xfrm6_beet_gso_segment(), it should
skip all nexthdrs and get the real transport proto, and set
transport_header properly.

This patch is to fix it by simply calling ipv6_skip_exthdr()
in xfrm6_beet_gso_segment().

v1->v2:
  - remove skb_transport_offset(), as it will always return 0
    in xfrm6_beet_gso_segment(), thank Sabrina's check.

Fixes: 7f9e40eb18 ("esp6: add gso_segment for esp6 beet mode")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-21 07:18:14 +02:00
Colin Ian King
b6246f4d8d net: ipv4: remove redundant assignment to variable rc
The variable rc is being assigned with a value that is never read
and it is being updated later with a new value. The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 16:26:58 -07:00
Paolo Abeni
fca5c82c08 mptcp: drop req socket remote_key* fields
We don't need them, as we can use the current ingress opt
data instead. Setting them in syn_recv_sock() may causes
inconsistent mptcp socket status, as per previous commit.

Fixes: cc7972ea19 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:59:33 -07:00
Paolo Abeni
4c8941de78 mptcp: avoid flipping mp_capable field in syn_recv_sock()
If multiple CPUs races on the same req_sock in syn_recv_sock(),
flipping such field can cause inconsistent child socket status.

When racing, the CPU losing the req ownership may still change
the mptcp request socket mp_capable flag while the CPU owning
the request is cloning the socket, leaving the child socket with
'is_mptcp' set but no 'mp_capable' flag.

Such socket will stay with 'conn' field cleared, heading to oops
in later mptcp callback.

Address the issue tracking the fallback status in a local variable.

Fixes: 58b0991962 ("mptcp: create msk early")
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:59:32 -07:00
Florian Westphal
5e20087d1b mptcp: handle mptcp listener destruction via rcu
Following splat can occur during self test:

 BUG: KASAN: use-after-free in subflow_data_ready+0x156/0x160
 Read of size 8 at addr ffff888100c35c28 by task mptcp_connect/4808

  subflow_data_ready+0x156/0x160
  tcp_child_process+0x6a3/0xb30
  tcp_v4_rcv+0x2231/0x3730
  ip_protocol_deliver_rcu+0x5c/0x860
  ip_local_deliver_finish+0x220/0x360
  ip_local_deliver+0x1c8/0x4e0
  ip_rcv_finish+0x1da/0x2f0
  ip_rcv+0xd0/0x3c0
  __netif_receive_skb_one_core+0xf5/0x160
  __netif_receive_skb+0x27/0x1c0
  process_backlog+0x21e/0x780
  net_rx_action+0x35f/0xe90
  do_softirq+0x4c/0x50
  [..]

This occurs when accessing subflow_ctx->conn.

Problem is that tcp_child_process() calls listen sockets'
sk_data_ready() notification, but it doesn't hold the listener
lock.  Another cpu calling close() on the listener will then cause
transition of refcount to 0.

Fixes: 58b0991962 ("mptcp: create msk early")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:59:32 -07:00
Andrew Lunn
77e9b2ab45 net: ethtool: self_test: Mark interface in testing operative status
When an interface is executing a self test, put the interface into
operative status testing.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:43:24 -07:00
Andrew Lunn
db30a57779 net: Add testing sysfs attribute
Similar to speed, duplex and dorment, report the testing status
in sysfs.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:43:24 -07:00
Andrew Lunn
eec517cdb4 net: Add IF_OPER_TESTING
RFC 2863 defines the operational state testing. Add support for this
state, both as a IF_LINK_MODE_ and __LINK_STATE_.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 12:43:24 -07:00
John Haxby
82c9ae4408 ipv6: fix restrict IPV6_ADDRFORM operation
Commit b6f6118901 ("ipv6: restrict IPV6_ADDRFORM operation") fixed a
problem found by syzbot an unfortunate logic error meant that it
also broke IPV6_ADDRFORM.

Rearrange the checks so that the earlier test is just one of the series
of checks made before moving the socket from IPv6 to IPv4.

Fixes: b6f6118901 ("ipv6: restrict IPV6_ADDRFORM operation")
Signed-off-by: John Haxby <john.haxby@oracle.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 11:06:39 -07:00
Tonghao Zhang
27de77cec9 net: openvswitch: ovs_ct_exit to be done under ovs_lock
syzbot wrote:
| =============================
| WARNING: suspicious RCU usage
| 5.7.0-rc1+ #45 Not tainted
| -----------------------------
| net/openvswitch/conntrack.c:1898 RCU-list traversed in non-reader section!!
|
| other info that might help us debug this:
| rcu_scheduler_active = 2, debug_locks = 1
| ...
|
| stack backtrace:
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-0-ga698c8995f-prebuilt.qemu.org 04/01/2014
| Workqueue: netns cleanup_net
| Call Trace:
| ...
| ovs_ct_exit
| ovs_exit_net
| ops_exit_list.isra.7
| cleanup_net
| process_one_work
| worker_thread

To avoid that warning, invoke the ovs_ct_exit under ovs_lock and add
lockdep_ovsl_is_held as optional lockdep expression.

Link: https://lore.kernel.org/lkml/000000000000e642a905a0cbee6e@google.com
Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Yi-Hung Wei <yihung.wei@gmail.com>
Reported-by: syzbot+7ef50afd3a211f879112@syzkaller.appspotmail.com
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-20 10:53:54 -07:00
Chuck Lever
48a124e383 xprtrdma: Fix use of xdr_stream_encode_item_{present, absent}
These new helpers do not return 0 on success, they return the
encoded size. Thus they are not a drop-in replacement for the
old helpers.

Fixes: 5c266df527 ("SUNRPC: Add encoders for list item discriminators")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20 10:45:01 -04:00
Chuck Lever
bdb2ce8281 xprtrdma: Fix trace point use-after-free race
It's not safe to use resources pointed to by the @send_wr of
ib_post_send() _after_ that function returns. Those resources are
typically freed by the Send completion handler, which can run before
ib_post_send() returns.

Thus the trace points currently around ib_post_send() in the
client's RPC/RDMA transport are a hazard, even when they are
disabled. Rearrange them so that they touch the Work Request only
_before_ ib_post_send() is invoked.

Fixes: ab03eff58e ("xprtrdma: Add trace points in RPC Call transmit paths")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20 10:44:01 -04:00
Chuck Lever
58bd6656f8 xprtrdma: Restore wake-up-all to rpcrdma_cm_event_handler()
Commit e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from
rpcrdma_xprt") erroneously removed a xprt_force_disconnect()
call from the "transport disconnect" path. The result was that the
client no longer responded to server-side disconnect requests.

Restore that call.

Fixes: e28ce90083 ("xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-04-20 10:41:00 -04:00
Sabrina Dubroca
9f0cadc32d xfrm: espintcp: save and call old ->sk_destruct
When ESP encapsulation is enabled on a TCP socket, I'm replacing the
existing ->sk_destruct callback with espintcp_destruct. We still need to
call the old callback to perform the other cleanups when the socket is
destroyed. Save the old callback, and call it from espintcp_destruct.

Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-20 07:34:16 +02:00
Xin Long
db87668ad1 xfrm: remove the xfrm_state_put call becofe going to out_reset
This xfrm_state_put call in esp4/6_gro_receive() will cause
double put for state, as in out_reset path secpath_reset()
will put all states set in skb sec_path.

So fix it by simply remove the xfrm_state_put call.

Fixes: 6ed69184ed ("xfrm: Reset secpath in xfrm failure")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-20 07:17:32 +02:00
Hillf Danton
b4faef1739 netfilter: nat: fix error handling upon registering inet hook
A case of warning was reported by syzbot.

------------[ cut here ]------------
WARNING: CPU: 0 PID: 19934 at net/netfilter/nf_nat_core.c:1106
nf_nat_unregister_fn+0x532/0x5c0 net/netfilter/nf_nat_core.c:1106
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 19934 Comm: syz-executor.5 Not tainted 5.6.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 panic+0x2e3/0x75c kernel/panic.c:221
 __warn.cold+0x2f/0x35 kernel/panic.c:582
 report_bug+0x27b/0x2f0 lib/bug.c:195
 fixup_bug arch/x86/kernel/traps.c:175 [inline]
 fixup_bug arch/x86/kernel/traps.c:170 [inline]
 do_error_trap+0x12b/0x220 arch/x86/kernel/traps.c:267
 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:286
 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:nf_nat_unregister_fn+0x532/0x5c0 net/netfilter/nf_nat_core.c:1106
Code: ff df 48 c1 ea 03 80 3c 02 00 75 75 48 8b 44 24 10 4c 89 ef 48 c7 00 00 00 00 00 e8 e8 f8 53 fb e9 4d fe ff ff e8 ee 9c 16 fb <0f> 0b e9 41 fe ff ff e8 e2 45 54 fb e9 b5 fd ff ff 48 8b 7c 24 20
RSP: 0018:ffffc90005487208 EFLAGS: 00010246
RAX: 0000000000040000 RBX: 0000000000000004 RCX: ffffc9001444a000
RDX: 0000000000040000 RSI: ffffffff865c94a2 RDI: 0000000000000005
RBP: ffff88808b5cf000 R08: ffff8880a2620140 R09: fffffbfff14bcd79
R10: ffffc90005487208 R11: fffffbfff14bcd78 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
 nf_nat_ipv6_unregister_fn net/netfilter/nf_nat_proto.c:1017 [inline]
 nf_nat_inet_register_fn net/netfilter/nf_nat_proto.c:1038 [inline]
 nf_nat_inet_register_fn+0xfc/0x140 net/netfilter/nf_nat_proto.c:1023
 nf_tables_register_hook net/netfilter/nf_tables_api.c:224 [inline]
 nf_tables_addchain.constprop.0+0x82e/0x13c0 net/netfilter/nf_tables_api.c:1981
 nf_tables_newchain+0xf68/0x16a0 net/netfilter/nf_tables_api.c:2235
 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433
 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline]
 nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6bf/0x7e0 net/socket.c:2362
 ___sys_sendmsg+0x100/0x170 net/socket.c:2416
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2449
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:295
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

and to quiesce it, unregister NFPROTO_IPV6 hook instead of NFPROTO_INET
in case of failing to register NFPROTO_IPV4 hook.

Reported-by: syzbot <syzbot+33e06702fd6cffc24c40@syzkaller.appspotmail.com>
Fixes: d164385ec5 ("netfilter: nat: add inet family nat support")
Cc: Florian Westphal <fw@strlen.de>
Cc: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-19 14:59:31 +02:00
Florian Westphal
9f5ca6a598 mptcp: fix 'Attempt to release TCP socket in state' warnings
We need to set sk_state to CLOSED, else we will get following:

IPv4: Attempt to release TCP socket in state 3 00000000b95f109e
IPv4: Attempt to release TCP socket in state 10 00000000b95f109e

First one is from inet_sock_destruct(), second one from
mptcp_sk_clone failure handling.  Setting sk_state to CLOSED isn't
enough, we also need to orphan sk so it has DEAD flag set.
Otherwise, a very similar warning is printed from inet_sock_destruct().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 15:43:20 -07:00
Florian Westphal
df1036da90 mptcp: fix splat when incoming connection is never accepted before exit/close
Following snippet (replicated from syzkaller reproducer) generates
warning: "IPv4: Attempt to release TCP socket in state 1".

int main(void) {
 struct sockaddr_in sin1 = { .sin_family = 2, .sin_port = 0x4e20,
                             .sin_addr.s_addr = 0x010000e0, };
 struct sockaddr_in sin2 = { .sin_family = 2,
	                     .sin_addr.s_addr = 0x0100007f, };
 struct sockaddr_in sin3 = { .sin_family = 2, .sin_port = 0x4e20,
	                     .sin_addr.s_addr = 0x0100007f, };
 int r0 = socket(0x2, 0x1, 0x106);
 int r1 = socket(0x2, 0x1, 0x106);

 bind(r1, (void *)&sin1, sizeof(sin1));
 connect(r1, (void *)&sin2, sizeof(sin2));
 listen(r1, 3);
 return connect(r0, (void *)&sin3, 0x4d);
}

Reason is that the newly generated mptcp socket is closed via the ulp
release of the tcp listener socket when its accept backlog gets purged.

To fix this, delay setting the ESTABLISHED state until after userspace
calls accept and via mptcp specific destructor.

Fixes: 58b0991962 ("mptcp: create msk early")
Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/9
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 15:43:20 -07:00
Alexander Aring
62e697767f ipv6: rpl: fix full address compression
This patch makes it impossible that cmpri or cmpre values are set to the
value 16 which is not possible, because these are 4 bit values. We
currently run in an overflow when assigning the value 16 to it.

According to the standard a value of 16 can be interpreted as a full
elided address which isn't possible to set as compression value. A reason
why this cannot be set is that the current ipv6 header destination address
should never show up inside the segments of the rpl header. In this case we
run in a overflow and the address will have no compression at all. Means
cmpri or compre is set to 0.

As we handle cmpri and cmpre sometimes as unsigned char or 4 bit value
inside the rpl header the current behaviour ends in an invalid header
format. This patch simple use the best compression method if we ever run
into the case that the destination address is showed up inside the rpl
segments. We avoid the overflow handling and the rpl header is still valid,
even when we have the destination address inside the rpl segments.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 15:04:27 -07:00
Xiyu Yang
de05842076 tipc: Fix potential tipc_node refcnt leak in tipc_rcv
tipc_rcv() invokes tipc_node_find() twice, which returns a reference of
the specified tipc_node object to "n" with increased refcnt.

When tipc_rcv() returns or a new object is assigned to "n", the original
local reference of "n" becomes invalid, so the refcount should be
decreased to keep refcount balanced.

The issue happens in some paths of tipc_rcv(), which forget to decrease
the refcnt increased by tipc_node_find() and will cause a refcnt leak.

Fix this issue by calling tipc_node_put() before the original object
pointed by "n" becomes invalid.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 13:24:20 -07:00
Xiyu Yang
441870ee42 tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv
tipc_crypto_rcv() invokes tipc_aead_get(), which returns a reference of
the tipc_aead object to "aead" with increased refcnt.

When tipc_crypto_rcv() returns, the original local reference of "aead"
becomes invalid, so the refcount should be decreased to keep refcount
balanced.

The issue happens in one error path of tipc_crypto_rcv(). When TIPC
message decryption status is EINPROGRESS or EBUSY, the function forgets
to decrease the refcnt increased by tipc_aead_get() and causes a refcnt
leak.

Fix this issue by calling tipc_aead_put() on the error path when TIPC
message decryption status is EINPROGRESS or EBUSY.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 13:17:04 -07:00
Xiyu Yang
d03f228470 net: netrom: Fix potential nr_neigh refcnt leak in nr_add_node
nr_add_node() invokes nr_neigh_get_dev(), which returns a local
reference of the nr_neigh object to "nr_neigh" with increased refcnt.

When nr_add_node() returns, "nr_neigh" becomes invalid, so the refcount
should be decreased to keep refcount balanced.

The issue happens in one normal path of nr_add_node(), which forgets to
decrease the refcnt increased by nr_neigh_get_dev() and causes a refcnt
leak. It should decrease the refcnt before the function returns like
other normal paths do.

Fix this issue by calling nr_neigh_put() before the nr_add_node()
returns.

Signed-off-by: Xiyu Yang <xiyuyang19@fudan.edu.cn>
Signed-off-by: Xin Tan <tanxin.ctf@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 13:09:46 -07:00
David S. Miller
513a24ffb3 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-04-17

Here's the first bluetooth-next pull request for the 5.8 kernel:

 - Added debugfs option to control MITM flag usage during pairing
 - Added new BT_MODE socket option
 - Added support for Qualcom QCA6390 device
 - Added support for Realtek RTL8761B device
 - Added support for mSBC audio codec over USB endpoints
 - Added framework for Microsoft HCI vendor extensions
 - Added new Read Security Information management command
 - Fixes/cleanup to link layer privacy related code
 - Various other smaller cleanups & fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-17 10:48:46 -07:00
Chuck Lever
23cf1ee1f1 svcrdma: Fix leak of svc_rdma_recv_ctxt objects
Utilize the xpo_release_rqst transport method to ensure that each
rqstp's svc_rdma_recv_ctxt object is released even when the server
cannot return a Reply for that rqstp.

Without this fix, each RPC whose Reply cannot be sent leaks one
svc_rdma_recv_ctxt. This is a 2.5KB structure, a 4KB DMA-mapped
Receive buffer, and any pages that might be part of the Reply
message.

The leak is infrequent unless the network fabric is unreliable or
Kerberos is in use, as GSS sequence window overruns, which result
in connection loss, are more common on fast transports.

Fixes: 3a88092ee3 ("svcrdma: Preserve Receive buffer until svc_rdma_sendto")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17 12:40:38 -04:00
Chuck Lever
e28b4fc652 svcrdma: Fix trace point use-after-free race
I hit this while testing nfsd-5.7 with kernel memory debugging
enabled on my server:

Mar 30 13:21:45 klimt kernel: BUG: unable to handle page fault for address: ffff8887e6c279a8
Mar 30 13:21:45 klimt kernel: #PF: supervisor read access in kernel mode
Mar 30 13:21:45 klimt kernel: #PF: error_code(0x0000) - not-present page
Mar 30 13:21:45 klimt kernel: PGD 3601067 P4D 3601067 PUD 87c519067 PMD 87c3e2067 PTE 800ffff8193d8060
Mar 30 13:21:45 klimt kernel: Oops: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
Mar 30 13:21:45 klimt kernel: CPU: 2 PID: 1933 Comm: nfsd Not tainted 5.6.0-rc6-00040-g881e87a3c6f9 #1591
Mar 30 13:21:45 klimt kernel: Hardware name: Supermicro Super Server/X10SRL-F, BIOS 1.0c 09/09/2015
Mar 30 13:21:45 klimt kernel: RIP: 0010:svc_rdma_post_chunk_ctxt+0xab/0x284 [rpcrdma]
Mar 30 13:21:45 klimt kernel: Code: c1 83 34 02 00 00 29 d0 85 c0 7e 72 48 8b bb a0 02 00 00 48 8d 54 24 08 4c 89 e6 48 8b 07 48 8b 40 20 e8 5a 5c 2b e1 41 89 c6 <8b> 45 20 89 44 24 04 8b 05 02 e9 01 00 85 c0 7e 33 e9 5e 01 00 00
Mar 30 13:21:45 klimt kernel: RSP: 0018:ffffc90000dfbdd8 EFLAGS: 00010286
Mar 30 13:21:45 klimt kernel: RAX: 0000000000000000 RBX: ffff8887db8db400 RCX: 0000000000000030
Mar 30 13:21:45 klimt kernel: RDX: 0000000000000040 RSI: 0000000000000000 RDI: 0000000000000246
Mar 30 13:21:45 klimt kernel: RBP: ffff8887e6c27988 R08: 0000000000000000 R09: 0000000000000004
Mar 30 13:21:45 klimt kernel: R10: ffffc90000dfbdd8 R11: 00c068ef00000000 R12: ffff8887eb4e4a80
Mar 30 13:21:45 klimt kernel: R13: ffff8887db8db634 R14: 0000000000000000 R15: ffff8887fc931000
Mar 30 13:21:45 klimt kernel: FS:  0000000000000000(0000) GS:ffff88885bd00000(0000) knlGS:0000000000000000
Mar 30 13:21:45 klimt kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8 CR3: 000000081b72e002 CR4: 00000000001606e0
Mar 30 13:21:45 klimt kernel: Call Trace:
Mar 30 13:21:45 klimt kernel: ? svc_rdma_vec_to_sg+0x7f/0x7f [rpcrdma]
Mar 30 13:21:45 klimt kernel: svc_rdma_send_write_chunk+0x59/0xce [rpcrdma]
Mar 30 13:21:45 klimt kernel: svc_rdma_sendto+0xf9/0x3ae [rpcrdma]
Mar 30 13:21:45 klimt kernel: ? nfsd_destroy+0x51/0x51 [nfsd]
Mar 30 13:21:45 klimt kernel: svc_send+0x105/0x1e3 [sunrpc]
Mar 30 13:21:45 klimt kernel: nfsd+0xf2/0x149 [nfsd]
Mar 30 13:21:45 klimt kernel: kthread+0xf6/0xfb
Mar 30 13:21:45 klimt kernel: ? kthread_queue_delayed_work+0x74/0x74
Mar 30 13:21:45 klimt kernel: ret_from_fork+0x3a/0x50
Mar 30 13:21:45 klimt kernel: Modules linked in: ocfs2_dlmfs ocfs2_stack_o2cb ocfs2_dlm ocfs2_nodemanager ocfs2_stackglue ib_umad ib_ipoib mlx4_ib sb_edac x86_pkg_temp_thermal iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel glue_helper crypto_simd cryptd pcspkr rpcrdma i2c_i801 rdma_ucm lpc_ich mfd_core ib_iser rdma_cm iw_cm ib_cm mei_me raid0 libiscsi mei sg scsi_transport_iscsi ioatdma wmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter nfsd nfs_acl lockd auth_rpcgss grace sunrpc ip_tables xfs libcrc32c mlx4_en sd_mod sr_mod cdrom mlx4_core crc32c_intel igb nvme i2c_algo_bit ahci i2c_core libahci nvme_core dca libata t10_pi qedr dm_mirror dm_region_hash dm_log dm_mod dax qede qed crc8 ib_uverbs ib_core
Mar 30 13:21:45 klimt kernel: CR2: ffff8887e6c279a8
Mar 30 13:21:45 klimt kernel: ---[ end trace 87971d2ad3429424 ]---

It's absolutely not safe to use resources pointed to by the @send_wr
argument of ib_post_send() _after_ that function returns. Those
resources are typically freed by the Send completion handler, which
can run before ib_post_send() returns.

Thus the trace points currently around ib_post_send() in the
server's RPC/RDMA transport are a hazard, even when they are
disabled. Rearrange them so that they touch the Work Request only
_before_ ib_post_send() is invoked.

Fixes: bd2abef333 ("svcrdma: Trace key RDMA API events")
Fixes: 4201c74647 ("svcrdma: Introduce svc_rdma_send_ctxt")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-17 12:40:38 -04:00
Chuck Lever
6221f1d9b6 SUNRPC: Fix backchannel RPC soft lockups
Currently, after the forward channel connection goes away,
backchannel operations are causing soft lockups on the server
because call_transmit_status's SOFTCONN logic ignores ENOTCONN.
Such backchannel Calls are aggressively retried until the client
reconnects.

Backchannel Calls should use RPC_TASK_NOCONNECT rather than
RPC_TASK_SOFTCONN. If there is no forward connection, the server is
not capable of establishing a connection back to the client, thus
that backchannel request should fail before the server attempts to
send it. Commit 58255a4e3c ("NFSD: NFSv4 callback client should
use RPC_TASK_SOFTCONN") was merged several years before
RPC_TASK_NOCONNECT was available.

Because setup_callback_client() explicitly sets NOPING, the NFSv4.0
callback connection depends on the first callback RPC to initiate
a connection to the client. Thus NFSv4.0 needs to continue to use
RPC_TASK_SOFTCONN.

Suggested-by: Trond Myklebust <trondmy@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: <stable@vger.kernel.org> # v4.20+
2020-04-17 12:40:31 -04:00
Linus Torvalds
c8372665b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.

 2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.

 3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.

 4) Adjust icmp6 message source address selection when routes have a
    preferred source address set, from Tim Stallard.

 5) Be sure to validate HSR protocol version when creating new links,
    from Taehee Yoo.

 6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
    non-initial namespaces, from Michael Weiß.

 7) Missing release firmware call in mlx5, from Eran Ben Elisha.

 8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
    Taehee Yoo.

 9) Fix pause frame negotiation in marvell phy driver, from Clemens
    Gruber.

10) Record RX queue early enough in tun packet paths such that XDP
    programs will see the correct RX queue index, from Gilberto Bertin.

11) Fix double unlock in mptcp, from Florian Westphal.

12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.

13) marvell10g needs to soft reset PHY when coming out of low power
    mode, from Russell King.

14) Fix MTU setting regression in stmmac for some chip types, from
    Florian Fainelli.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
  amd-xgbe: Use __napi_schedule() in BH context
  mISDN: make dmril and dmrim static
  net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
  net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
  tipc: fix incorrect increasing of link window
  Documentation: Fix tcp_challenge_ack_limit default value
  net: tulip: make early_486_chipsets static
  dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
  ipv6: remove redundant assignment to variable err
  net/rds: Use ERR_PTR for rds_message_alloc_sgs()
  net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
  selftests/bpf: Check for correct program attach/detach in xdp_attach test
  libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
  libbpf: Always specify expected_attach_type on program load if supported
  xsk: Add missing check on user supplied headroom size
  mac80211: fix channel switch trigger from unknown mesh peer
  mac80211: fix race in ieee80211_register_hw()
  net: marvell10g: soft-reset the PHY when coming out of low power
  net: marvell10g: report firmware version
  net/cxgb4: Check the return from t4_query_params properly
  ...
2020-04-16 14:52:29 -07:00
Marcel Holtmann
7edc907954 Bluetooth: Enhanced Connection Complete event belongs to LL Privacy
The Enhanced Connection Complete event is use in conjunction with LL
Privacy and not Extended Advertising.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-16 12:43:03 +03:00
Tuong Lien
edadedf1c5 tipc: fix incorrect increasing of link window
In commit 16ad3f4022 ("tipc: introduce variable window congestion
control"), we allow link window to change with the congestion avoidance
algorithm. However, there is a bug that during the slow-start if packet
retransmission occurs, the link will enter the fast-recovery phase, set
its window to the 'ssthresh' which is never less than 300, so the link
window suddenly increases to that limit instead of decreasing.

Consequently, two issues have been observed:

- For broadcast-link: it can leave a gap between the link queues that a
new packet will be inserted and sent before the previous ones, i.e. not
in-order.

- For unicast: the algorithm does not work as expected, the link window
jumps to the slow-start threshold whereas packet retransmission occurs.

This commit fixes the issues by avoiding such the link window increase,
but still decreasing if the 'ssthresh' is lowered.

Fixes: 16ad3f4022 ("tipc: introduce variable window congestion control")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-15 16:23:33 -07:00
Colin Ian King
672e24772a ipv6: remove redundant assignment to variable err
The variable err is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-15 16:20:22 -07:00
Will Deacon
9a8939490d net: tls: Avoid assigning 'const' pointer to non-const pointer
tls_build_proto() uses WRITE_ONCE() to assign a 'const' pointer to a
'non-const' pointer. Cleanups to the implementation of WRITE_ONCE() mean
that this will give rise to a compiler warning, just like a plain old
assignment would do:

  | net/tls/tls_main.c: In function ‘tls_build_proto’:
  | ./include/linux/compiler.h:229:30: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  | net/tls/tls_main.c:640:4: note: in expansion of macro ‘smp_store_release’
  |   640 |    smp_store_release(&saved_tcpv6_prot, prot);
  |       |    ^~~~~~~~~~~~~~~~~

Drop the const qualifier from the local 'prot' variable, as it isn't
needed.

Cc: Boris Pismenny <borisp@mellanox.com>
Cc: Aviad Yehezkel <aviadye@mellanox.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15 21:36:41 +01:00
Will Deacon
514cc55b01 netfilter: Avoid assigning 'const' pointer to non-const pointer
nf_remove_net_hook() uses WRITE_ONCE() to assign a 'const' pointer to a
'non-const' pointer. Cleanups to the implementation of WRITE_ONCE() mean
that this will give rise to a compiler warning, just like a plain old
assignment would do:

  | In file included from ./include/linux/export.h:43,
  |                  from ./include/linux/linkage.h:7,
  |                  from ./include/linux/kernel.h:8,
  |                  from net/netfilter/core.c:9:
  | net/netfilter/core.c: In function ‘nf_remove_net_hook’:
  | ./include/linux/compiler.h:216:30: warning: assignment discards ‘const’ qualifier from pointer target type [-Wdiscarded-qualifiers]
  |   *(volatile typeof(x) *)&(x) = (val);  \
  |                               ^
  | net/netfilter/core.c:379:3: note: in expansion of macro ‘WRITE_ONCE’
  |    WRITE_ONCE(orig_ops[i], &dummy_ops);
  |    ^~~~~~~~~~

Follow the pattern used elsewhere in this file and add a cast to 'void *'
to squash the warning.

Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Cc: Florian Westphal <fw@strlen.de>
Cc: "David S. Miller" <davem@davemloft.net>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Will Deacon <will@kernel.org>
2020-04-15 21:36:41 +01:00
Jason Gunthorpe
7dba92037b net/rds: Use ERR_PTR for rds_message_alloc_sgs()
Returning the error code via a 'int *ret' when the function returns a
pointer is very un-kernely and causes gcc 10's static analysis to choke:

net/rds/message.c: In function ‘rds_message_map_pages’:
net/rds/message.c:358:10: warning: ‘ret’ may be used uninitialized in this function [-Wmaybe-uninitialized]
  358 |   return ERR_PTR(ret);

Use a typical ERR_PTR return instead.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-15 12:33:29 -07:00
David S. Miller
78b877113f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-04-15

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 3 day(s) which contain
a total of 11 files changed, 238 insertions(+), 95 deletions(-).

The main changes are:

1) Fix offset overflow for BPF_MEM BPF_DW insn mapping on arm32 JIT,
   from Luke Nelson and Xi Wang.

2) Prevent mprotect() to make frozen & mmap()'ed BPF map writeable
   again, from Andrii Nakryiko and Jann Horn.

3) Fix type of old_fd in bpf_xdp_set_link_opts to int in libbpf and add
   selftests, from Toke Høiland-Jørgensen.

4) Fix AF_XDP to check that headroom cannot be larger than the available
   space in the chunk, from Magnus Karlsson.

5) Fix reset of XDP prog when expected_fd is set, from David Ahern.

6) Fix a segfault in bpftool's struct_ops command when BTF is not
   available, from Daniel T. Lee.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-15 11:35:42 -07:00
David S. Miller
6058ee09ec A couple of fixes:
* FTM responder policy netlink validation fix
    (but the only user validates again later)
  * kernel-doc fixes
  * a fix for a race in mac80211 radio registration vs. userspace
  * a mesh channel switch fix
  * a fix for a syzbot reported kasprintf() issue
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl6WyIAACgkQB8qZga/f
 l8TqyBAAnR+HrstmY0g8pIfsN925axUP4blRuS4U55olmEvS4xFOKhIQ5ksMceD0
 ndvTxXOjJ3EA6Zc5icGG+hcPihhcO/V6UzE83KHqVEa0LP7o6Nmjyh5kiyl1HS6y
 naB6/g3AQcUGqD1zsT8EZ6HeePXDVMxnT0CORC2JtXJCPmBeLpfAROEhKmLuuonp
 WugB8monGfWQuNJTNILTg4Ya8/LR4GW7ABTNrHOnAqCpexBhrGcxdbiiqTm8ionF
 sLcXkfzGd7KudTxS0qTJZnR3Bja7o96FjPxy1If5eQFUiDbgVYr0C8I69X6fTzyP
 l4O5b1DRcLLX/1N8H5DmUsRMk4froUtVY9hqTIVE956d1HWqOSfXXMx4yxpvqCp+
 dzOM6mHzfWp29aKSNqtbkrDjGJB7EviR8Fan6Sngh2EgvUVhpdwpAaKepdcZHUkW
 th7ukrh/KTTaied8zu8CS5RVJPn/m4afsxl05yGSDh7kJ+VN9We+0fytOb+zvyUi
 HO9ru0w++zqdXU82l4qaQvLJXISpEyf53IqM2PnHIS5Zvo/3HnGC2vogHGiwAccY
 yNbD3IYueGIgzRrSN2CaAmoFY1CzC5rK73y9Qh1JIJqWOKv66MhBBPjm3kBraiqM
 hj5BH3Qz9PYnaNk1uDwHZN9dLPYQr0GPHPJEMx3QsRUVau95F0Q=
 =1yy8
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-04-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A couple of fixes:
 * FTM responder policy netlink validation fix
   (but the only user validates again later)
 * kernel-doc fixes
 * a fix for a race in mac80211 radio registration vs. userspace
 * a mesh channel switch fix
 * a fix for a syzbot reported kasprintf() issue
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-15 11:27:23 -07:00
Marcel Holtmann
ff3b8df2bd Bluetooth: Enable LE Enhanced Connection Complete event.
In case LL Privacy is supported by the controller, it is also a good
idea to use the LE Enhanced Connection Complete event for getting all
information about the new connection and its addresses.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-15 16:51:05 +03:00
Marcel Holtmann
849c9c35e8 Bluetooth: Use extra variable to make code more readable
When starting active scanning for discovery the whitelist is not needed
to be used. So the filter_policy is 0x00. To make the core more readable
use a variable name instead of just setting 0 as paramter.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-15 16:51:02 +03:00
Magnus Karlsson
99e3a236dd xsk: Add missing check on user supplied headroom size
Add a check that the headroom cannot be larger than the available
space in the chunk. In the current code, a malicious user can set the
headroom to a value larger than the chunk size minus the fixed XDP
headroom. That way packets with a length larger than the supported
size in the umem could get accepted and result in an out-of-bounds
write.

Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Reported-by: Bui Quang Minh <minhquangbui99@gmail.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://bugzilla.kernel.org/show_bug.cgi?id=207225
Link: https://lore.kernel.org/bpf/1586849715-23490-1-git-send-email-magnus.karlsson@intel.com
2020-04-15 13:07:18 +02:00
Tamizh chelvam
93e2d04a18 mac80211: fix channel switch trigger from unknown mesh peer
Previously mesh channel switch happens if beacon contains
CSA IE without checking the mesh peer info. Due to that
channel switch happens even if the beacon is not from
its own mesh peer. Fixing that by checking if the CSA
originated from the same mesh network before proceeding
for channel switch.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1585403604-29274-1-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-15 09:54:26 +02:00
Xin Long
3c96ec5682 esp6: get the right proto for transport mode in esp6_gso_encap
For transport mode, when ipv6 nexthdr is set, the packet format might
be like:

    ----------------------------------------------------
    |        | dest |     |     |      |  ESP    | ESP |
    | IP6 hdr| opts.| ESP | TCP | Data | Trailer | ICV |
    ----------------------------------------------------

What it wants to get for x-proto in esp6_gso_encap() is the proto that
will be set in ESP nexthdr. So it should skip all ipv6 nexthdrs and
get the real transport protocol. Othersize, the wrong proto number
will be set into ESP nexthdr.

This patch is to skip all ipv6 nexthdrs by calling ipv6_skip_exthdr()
in esp6_gso_encap().

Fixes: 7862b4058b ("esp: Add gso handlers for esp4 and esp6")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-15 09:50:03 +02:00
Xin Long
06a0afcfe2 xfrm: do pskb_pull properly in __xfrm_transport_prep
For transport mode, when ipv6 nexthdr is set, the packet format might
be like:

    ----------------------------------------------------
    |        | dest |     |     |      |  ESP    | ESP |
    | IP6 hdr| opts.| ESP | TCP | Data | Trailer | ICV |
    ----------------------------------------------------

and in __xfrm_transport_prep():

  pskb_pull(skb, skb->mac_len + sizeof(ip6hdr) + x->props.header_len);

it will pull the data pointer to the wrong position, as it missed the
nexthdrs/dest opts.

This patch is to fix it by using:

  pskb_pull(skb, skb_transport_offset(skb) + x->props.header_len);

as we can be sure transport_header points to ESP header at that moment.

It also fixes a panic when packets with ipv6 nexthdr are sent over
esp6 transport mode:

  [  100.473845] kernel BUG at net/core/skbuff.c:4325!
  [  100.478517] RIP: 0010:__skb_to_sgvec+0x252/0x260
  [  100.494355] Call Trace:
  [  100.494829]  skb_to_sgvec+0x11/0x40
  [  100.495492]  esp6_output_tail+0x12e/0x550 [esp6]
  [  100.496358]  esp6_xmit+0x1d5/0x260 [esp6_offload]
  [  100.498029]  validate_xmit_xfrm+0x22f/0x2e0
  [  100.499604]  __dev_queue_xmit+0x589/0x910
  [  100.502928]  ip6_finish_output2+0x2a5/0x5a0
  [  100.503718]  ip6_output+0x6c/0x120
  [  100.505198]  xfrm_output_resume+0x4bf/0x530
  [  100.508683]  xfrm6_output+0x3a/0xc0
  [  100.513446]  inet6_csk_xmit+0xa1/0xf0
  [  100.517335]  tcp_sendmsg+0x27/0x40
  [  100.517977]  sock_sendmsg+0x3e/0x60
  [  100.518648]  __sys_sendto+0xee/0x160

Fixes: c35fe4106b ("xfrm: Add mode handlers for IPsec on layer 2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-15 09:50:03 +02:00
Xin Long
afcaf61be9 xfrm: allow to accept packets with ipv6 NEXTHDR_HOP in xfrm_input
For beet mode, when it's ipv6 inner address with nexthdrs set,
the packet format might be:

    ----------------------------------------------------
    | outer  |     | dest |     |      |  ESP    | ESP |
    | IP hdr | ESP | opts.| TCP | Data | Trailer | ICV |
    ----------------------------------------------------

The nexthdr from ESP could be NEXTHDR_HOP(0), so it should
continue processing the packet when nexthdr returns 0 in
xfrm_input(). Otherwise, when ipv6 nexthdr is set, the
packet will be dropped.

I don't see any error cases that nexthdr may return 0. So
fix it by removing the check for nexthdr == 0.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-15 09:50:03 +02:00
Sumit Garg
52e04b4ce5 mac80211: fix race in ieee80211_register_hw()
A race condition leading to a kernel crash is observed during invocation
of ieee80211_register_hw() on a dragonboard410c device having wcn36xx
driver built as a loadable module along with a wifi manager in user-space
waiting for a wifi device (wlanX) to be active.

Sequence diagram for a particular kernel crash scenario:

    user-space  ieee80211_register_hw()  ieee80211_tasklet_handler()
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
       |                    |                 |
       |<---phy0----wiphy_register()          |
       |-----iwd if_add---->|                 |
       |                    |<---IRQ----(RX packet)
       |              Kernel crash            |
       |              due to unallocated      |
       |              workqueue.              |
       |                    |                 |
       |       alloc_ordered_workqueue()      |
       |                    |                 |
       |              Misc wiphy init.        |
       |                    |                 |
       |            ieee80211_if_add()        |
       |                    |                 |

As evident from above sequence diagram, this race condition isn't specific
to a particular wifi driver but rather the initialization sequence in
ieee80211_register_hw() needs to be fixed. So re-order the initialization
sequence and the updated sequence diagram would look like:

    user-space  ieee80211_register_hw()  ieee80211_tasklet_handler()
    ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
       |                    |                 |
       |       alloc_ordered_workqueue()      |
       |                    |                 |
       |              Misc wiphy init.        |
       |                    |                 |
       |<---phy0----wiphy_register()          |
       |-----iwd if_add---->|                 |
       |                    |<---IRQ----(RX packet)
       |                    |                 |
       |            ieee80211_if_add()        |
       |                    |                 |

Cc: stable@vger.kernel.org
Signed-off-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/1586254255-28713-1-git-send-email-sumit.garg@linaro.org
[Johannes: fix rtnl imbalances]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-15 09:31:58 +02:00
Andrew Lunn
3be98b2d5f net: dsa: Down cpu/dsa ports phylink will control
DSA and CPU ports can be configured in two ways. By default, the
driver should configure such ports to there maximum bandwidth. For
most use cases, this is sufficient. When this default is insufficient,
a phylink instance can be bound to such ports, and phylink will
configure the port, e.g. based on fixed-link properties. phylink
assumes the port is initially down. Given that the driver should have
already configured it to its maximum speed, ask the driver to down
the port before instantiating the phylink instance.

Fixes: 30c4a5b0aa ("net: mv88e6xxx: use resolved link config in mac_link_up()")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-14 16:33:26 -07:00
David Howells
0e631eee17 rxrpc: Fix DATA Tx to disable nofrag for UDP on AF_INET6 socket
Fix the DATA packet transmission to disable nofrag for UDPv4 on an AF_INET6
socket as well as UDPv6 when trying to transmit fragmentably.

Without this, packets filled to the normal size used by the kernel AFS
client of 1412 bytes be rejected by udp_sendmsg() with EMSGSIZE
immediately.  The ->sk_error_report() notification hook is called, but
rxrpc doesn't generate a trace for it.

This is a temporary fix; a more permanent solution needs to involve
changing the size of the packets being filled in accordance with the MTU,
which isn't currently done in AF_RXRPC.  The reason for not doing so was
that, barring the last packet in an rx jumbo packet, jumbos can only be
assembled out of 1412-byte packets - and the plan was to construct jumbos
on the fly at transmission time.

Also, there's no point turning on IPV6_MTU_DISCOVER, since IPv6 has to
engage in this anyway since fragmentation is only done by the sender.  We
can then condense the switch-statement in rxrpc_send_data_packet().

Fixes: 75b54cb57c ("rxrpc: Add IPv6 support")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-14 16:26:47 -07:00
Roi Dayan
bc8e71314e netfilter: flowtable: Free block_cb when being deleted
Free block_cb memory when asked to be deleted.

Fixes: 978703f425 ("netfilter: flowtable: Add API for registering to flow table events")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-15 00:55:33 +02:00
David Ahern
dfa74909cb xdp: Reset prog in dev_change_xdp_fd when fd is negative
The commit mentioned in the Fixes tag reuses the local prog variable
when looking up an expected_fd. The variable is not reset when fd < 0
causing a detach with the expected_fd set to actually call
dev_xdp_install for the existing program. The end result is that the
detach does not happen.

Fixes: 92234c8f15 ("xdp: Support specifying expected existing program when attaching XDP")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200412133204.43847-1-dsahern@kernel.org
2020-04-14 21:38:02 +02:00
Jason Gunthorpe
1587982e70 RDMA: Remove a few extra calls to ib_get_client_data()
These four places already have easy access to the client data, just use
that instead.

Link: https://lore.kernel.org/r/0-v1-fae83f600b4a+68-less_get_client_data%25jgg@mellanox.com
Acked-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2020-04-14 16:05:08 -03:00
Johannes Berg
0e012b4e4b nl80211: fix NL80211_ATTR_FTM_RESPONDER policy
The nested policy here should be established using the
NLA_POLICY_NESTED() macro so the length is properly
filled in.

Cc: stable@vger.kernel.org
Fixes: 81e54d08d9 ("cfg80211: support FTM responder configuration/statistics")
Link: https://lore.kernel.org/r/20200412004029.9d0722bb56c8.Ie690bfcc4a1a61ff8d8ca7e475d59fcaa52fb2da@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-14 12:28:48 +02:00
Yihao Wu
43e33924c3 SUNRPC/cache: Fix unsafe traverse caused double-free in cache_purge
Deleting list entry within hlist_for_each_entry_safe is not safe unless
next pointer (tmp) is protected too. It's not, because once hash_lock
is released, cache_clean may delete the entry that tmp points to. Then
cache_purge can walk to a deleted entry and tries to double free it.

Fix this bug by holding only the deleted entry's reference.

Suggested-by: NeilBrown <neilb@suse.de>
Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com>
Reviewed-by: NeilBrown <neilb@suse.de>
[ cel: removed unused variable ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-04-13 10:28:21 -04:00
Florian Westphal
e154659ba3 mptcp: fix double-unlock in mptcp_poll
mptcp_connect/28740 is trying to release lock (sk_lock-AF_INET) at:
[<ffffffff82c15869>] mptcp_poll+0xb9/0x550
but there are no more locks to release!
Call Trace:
 lock_release+0x50f/0x750
 release_sock+0x171/0x1b0
 mptcp_poll+0xb9/0x550
 sock_poll+0x157/0x470
 ? get_net_ns+0xb0/0xb0
 do_sys_poll+0x63c/0xdd0

Problem is that __mptcp_tcp_fallback() releases the mptcp socket lock,
but after recent change it doesn't do this in all of its return paths.

To fix this, remove the unlock from __mptcp_tcp_fallback() and
always do the unlock in the caller.

Also add a small comment as to why we have this
__mptcp_needs_tcp_fallback().

Fixes: 0b4f33def7 ("mptcp: fix tcp fallback crash")
Reported-by: syzbot+e56606435b7bfeea8cf5@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-12 21:04:08 -07:00
Linus Torvalds
b753101a4a Kbuild updates for v5.7 (2nd)
- raise minimum supported binutils version to 2.23
 
  - remove old CONFIG_AS_* macros that we know binutils >= 2.23 supports
 
  - move remaining CONFIG_AS_* tests to Kconfig from Makefile
 
  - enable -Wtautological-compare warnings to catch more issues
 
  - do not support GCC plugins for GCC <= 4.7
 
  - fix various breakages of 'make xconfig'
 
  - include the linker version used for linking the kernel into
    LINUX_COMPILER, which is used for the banner, and also exposed to
    /proc/version
 
  - link lib-y objects to vmlinux forcibly when CONFIG_MODULES=y,
    which allows us to remove the lib-ksyms.o workaround, and to
    solve the last known issue of the LLVM linker
 
  - add dummy tools in scripts/dummy-tools/ to enable all compiler
    tests in Kconfig, which will be useful for distro maintainers
 
  - support the single switch, LLVM=1 to use Clang and all LLVM utilities
    instead of GCC and Binutils.
 
  - support LLVM_IAS=1 to enable the integrated assembler, which is still
    experimental
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl6RNqEVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGZPEP/3affmzIWJuKGF1RErOHK3KCe/uX
 PmLjoRZ7im7V+J4b3W+p+re6BOXIXhW+rtKoP/Ijuys9g80WeeAb2nB4h0ESOtff
 3NgN97v28mh4tVtbluJambFDXItei+UwDp1sgg2sZ7ehaSBVny9hgNmPRn5YcyoS
 O3Juy85q70l8awBWThjEHgSxEw2Rzh9PLE6YmMt40rHTxVEDjMOPSuBlp/+TWj3X
 ugF/wInp+J5mCAKCwJI4L6PavdwIwf9hg3Cv/DpoOw60TxwH+7Rq6RueDKBgHhe3
 UEPHrXyPCsF/JQwwSFxN7k481RV2PjkXFwA3U5vH+3WIRb4ETX0+fmBIrLPSAX4z
 6rZiEvdrGS4TVvW2i8mrkJUrLPHNyQ90q/FU0V18A1k77Cv7mWJjSebTAVYNvz/v
 f/DxApaepwprdtHcNYJMN/TVnwxNexJK+U+bkuXsmDggvZYCxwLQUjtI3Sab1Rv9
 C6Y8WgqKx8yP6NbqVtUMkwXdEhBiHgybVxkl9hseUEbhUElIViuq5rlrHa0FVt2Q
 w4orgFXOd7k5iuDr7ka+wa3p20KLQQuB+vwLaCpi35+4vepQ7P0i2tFNwSclo7lO
 +iNy4Bq20W0/cmQeUJIzctJGibwro1I3HPN1UJ7gp0fZ2WVGzV0SKpwQ0tLOVuuU
 y9yPsL1ciDpKQKMh
 =jpyF
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - raise minimum supported binutils version to 2.23

 - remove old CONFIG_AS_* macros that we know binutils >= 2.23 supports

 - move remaining CONFIG_AS_* tests to Kconfig from Makefile

 - enable -Wtautological-compare warnings to catch more issues

 - do not support GCC plugins for GCC <= 4.7

 - fix various breakages of 'make xconfig'

 - include the linker version used for linking the kernel into
   LINUX_COMPILER, which is used for the banner, and also exposed to
   /proc/version

 - link lib-y objects to vmlinux forcibly when CONFIG_MODULES=y, which
   allows us to remove the lib-ksyms.o workaround, and to solve the last
   known issue of the LLVM linker

 - add dummy tools in scripts/dummy-tools/ to enable all compiler tests
   in Kconfig, which will be useful for distro maintainers

 - support the single switch, LLVM=1 to use Clang and all LLVM utilities
   instead of GCC and Binutils.

 - support LLVM_IAS=1 to enable the integrated assembler, which is still
   experimental

* tag 'kbuild-v5.7-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (36 commits)
  kbuild: fix comment about missing include guard detection
  kbuild: support LLVM=1 to switch the default tools to Clang/LLVM
  kbuild: replace AS=clang with LLVM_IAS=1
  kbuild: add dummy toolchains to enable all cc-option etc. in Kconfig
  kbuild: link lib-y objects to vmlinux forcibly when CONFIG_MODULES=y
  MIPS: fw: arc: add __weak to prom_meminit and prom_free_prom_memory
  kbuild: remove -I$(srctree)/tools/include from scripts/Makefile
  kbuild: do not pass $(KBUILD_CFLAGS) to scripts/mkcompile_h
  Documentation/llvm: fix the name of llvm-size
  kbuild: mkcompile_h: Include $LD version in /proc/version
  kconfig: qconf: Fix a few alignment issues
  kconfig: qconf: remove some old bogus TODOs
  kconfig: qconf: fix support for the split view mode
  kconfig: qconf: fix the content of the main widget
  kconfig: qconf: Change title for the item window
  kconfig: qconf: clean deprecated warnings
  gcc-plugins: drop support for GCC <= 4.7
  kbuild: Enable -Wtautological-compare
  x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2
  crypto: x86 - clean up poly1305-x86_64-cryptogams.S by 'make clean'
  ...
2020-04-11 09:46:12 -07:00
David S. Miller
40fc7ad2c8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-04-10

The following pull-request contains BPF updates for your *net* tree.

We've added 13 non-merge commits during the last 7 day(s) which contain
a total of 13 files changed, 137 insertions(+), 43 deletions(-).

The main changes are:

1) JIT code emission fixes for riscv and arm32, from Luke Nelson and Xi Wang.

2) Disable vmlinux BTF info if GCC_PLUGIN_RANDSTRUCT is used, from Slava Bacherikov.

3) Fix oob write in AF_XDP when meta data is used, from Li RongQing.

4) Fix bpf_get_link_xdp_id() handling on single prog when flags are specified,
   from Andrey Ignatov.

5) Fix sk_assign() BPF helper for request sockets that can have sk_reuseport
   field uninitialized, from Joe Stringer.

6) Fix mprotect() test case for the BPF LSM, from KP Singh.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 17:39:22 -07:00
Daniels Umanovskis
943d5d92c5 Bluetooth: log advertisement packet length if it gets corrected
The error could indicate a problem with the Bluetooth device. It
is easier to investigate if the packet's actual length gets logged,
not just the fact that a discrepancy occurred.

Signed-off-by: Daniels Umanovskis <du@axentia.se>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-09 19:57:28 +02:00
Taras Chornyi
690cc86321 net: ipv4: devinet: Fix crash when add/del multicast IP with autojoin
When CONFIG_IP_MULTICAST is not set and multicast ip is added to the device
with autojoin flag or when multicast ip is deleted kernel will crash.

steps to reproduce:

ip addr add 224.0.0.0/32 dev eth0
ip addr del 224.0.0.0/32 dev eth0

or

ip addr add 224.0.0.0/32 dev eth0 autojoin

Unable to handle kernel NULL pointer dereference at virtual address 0000000000000088
 pc : _raw_write_lock_irqsave+0x1e0/0x2ac
 lr : lock_sock_nested+0x1c/0x60
 Call trace:
  _raw_write_lock_irqsave+0x1e0/0x2ac
  lock_sock_nested+0x1c/0x60
  ip_mc_config.isra.28+0x50/0xe0
  inet_rtm_deladdr+0x1a8/0x1f0
  rtnetlink_rcv_msg+0x120/0x350
  netlink_rcv_skb+0x58/0x120
  rtnetlink_rcv+0x14/0x20
  netlink_unicast+0x1b8/0x270
  netlink_sendmsg+0x1a0/0x3b0
  ____sys_sendmsg+0x248/0x290
  ___sys_sendmsg+0x80/0xc0
  __sys_sendmsg+0x68/0xc0
  __arm64_sys_sendmsg+0x20/0x30
  el0_svc_common.constprop.2+0x88/0x150
  do_el0_svc+0x20/0x80
 el0_sync_handler+0x118/0x190
  el0_sync+0x140/0x180

Fixes: 93a714d6b5 ("multicast: Extend ip address command to enable multicast group join/leave on")
Signed-off-by: Taras Chornyi <taras.chornyi@plvision.eu>
Signed-off-by: Vadym Kochan <vadym.kochan@plvision.eu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 10:27:23 -07:00
Ka-Cheong Poon
2fabef4f65 net/rds: Fix MR reference counting problem
In rds_free_mr(), it calls rds_destroy_mr(mr) directly.  But this
defeats the purpose of reference counting and makes MR free handling
impossible.  It means that holding a reference does not guarantee that
it is safe to access some fields.  For example, In
rds_cmsg_rdma_dest(), it increases the ref count, unlocks and then
calls mr->r_trans->sync_mr().  But if rds_free_mr() (and
rds_destroy_mr()) is called in between (there is no lock preventing
this to happen), r_trans_private is set to NULL, causing a panic.
Similar issue is in rds_rdma_unuse().

Reported-by: zerons <sironhide0null@gmail.com>
Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 10:22:00 -07:00
Ka-Cheong Poon
e228a5d05e net/rds: Replace struct rds_mr's r_refcount with struct kref
And removed rds_mr_put().

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 10:22:00 -07:00
Colin Ian King
5f0224a6dd net-sysfs: remove redundant assignment to variable ret
The variable ret is being initialized with a value that is never read
and it is being updated later with a new value.  The initialization is
redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 10:14:53 -07:00
Wang Wenhu
6dbf02acef net: qrtr: send msgs from local of same id as broadcast
If the local node id(qrtr_local_nid) is not modified after its
initialization, it equals to the broadcast node id(QRTR_NODE_BCAST).
So the messages from local node should not be taken as broadcast
and keep the process going to send them out anyway.

The definitions are as follow:
static unsigned int qrtr_local_nid = NUMA_NO_NODE;

Fixes: fdf5fd3975 ("net: qrtr: Broadcast messages only from control port")
Signed-off-by: Wang Wenhu <wenhu.wang@vivo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-09 10:08:31 -07:00
Linus Torvalds
fcc95f0640 The main items are:
- support for asynchronous create and unlink (Jeff Layton).  Creates
   and unlinks are satisfied locally, without waiting for a reply from
   the MDS, provided the client has been granted appropriate caps (new
   in v15.y.z ("Octopus") release).  This can be a big help for metadata
   heavy workloads such as tar and rsync.  Opt-in with the new nowsync
   mount option.
 
 - multiple blk-mq queues for rbd (Hannes Reinecke and myself).  When
   the driver was converted to blk-mq, we settled on a single blk-mq
   queue because of a global lock in libceph and some other technical
   debt.  These have since been addressed, so allocate a queue per CPU
   to enhance parallelism.
 
 - don't hold onto caps that aren't actually needed (Zheng Yan).  This
   has been our long-standing behavior, but it causes issues with some
   active/standby applications (synchronous I/O, stalls if the standby
   goes down, etc).
 
 - .snap directory timestamps consistent with ceph-fuse (Luis Henriques)
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl6OEO4THGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi0XNB/wItYipkjlL5fIUBqiRWzYai72DWdPp
 CnOZo8LB+O0MQDomPT6DdpU1OMlWZi5HF7zklrZ35LTm21UkRNC9zvccjs9l66PJ
 qo9cKJbxxju+hgzIvgvK9PjlDlaiFAc/pkF8lZ/NaOnSsM1vvsFL9IuY2LXS38MY
 A/uUTZNUnFy5udam8TPuN+gWwZcUIH48lRWQLWe2I/hNJSweX1l8OHvecOBg+cYH
 G+8vb7mLU2V9ky0YT5JJmVxUV3CWA5wH6ZrWWy1ofVDdeSFLPrhgWX6IMjaNq+Gd
 xPfxmly47uBviSqON9dMkiThgy0Qj7yi0Pvx+1sAZbD7aj/6A4qg3LX5
 =GIX0
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main items are:

   - support for asynchronous create and unlink (Jeff Layton).

     Creates and unlinks are satisfied locally, without waiting for a
     reply from the MDS, provided the client has been granted
     appropriate caps (new in v15.y.z ("Octopus") release). This can be
     a big help for metadata heavy workloads such as tar and rsync.
     Opt-in with the new nowsync mount option.

   - multiple blk-mq queues for rbd (Hannes Reinecke and myself).

     When the driver was converted to blk-mq, we settled on a single
     blk-mq queue because of a global lock in libceph and some other
     technical debt. These have since been addressed, so allocate a
     queue per CPU to enhance parallelism.

   - don't hold onto caps that aren't actually needed (Zheng Yan).

     This has been our long-standing behavior, but it causes issues with
     some active/standby applications (synchronous I/O, stalls if the
     standby goes down, etc).

   - .snap directory timestamps consistent with ceph-fuse (Luis
     Henriques)"

* tag 'ceph-for-5.7-rc1' of git://github.com/ceph/ceph-client: (49 commits)
  ceph: fix snapshot directory timestamps
  ceph: wait for async creating inode before requesting new max size
  ceph: don't skip updating wanted caps when cap is stale
  ceph: request new max size only when there is auth cap
  ceph: cleanup return error of try_get_cap_refs()
  ceph: return ceph_mdsc_do_request() errors from __get_parent()
  ceph: check all mds' caps after page writeback
  ceph: update i_requested_max_size only when sending cap msg to auth mds
  ceph: simplify calling of ceph_get_fmode()
  ceph: remove delay check logic from ceph_check_caps()
  ceph: consider inode's last read/write when calculating wanted caps
  ceph: always renew caps if mds_wanted is insufficient
  ceph: update dentry lease for async create
  ceph: attempt to do async create when possible
  ceph: cache layout in parent dir on first sync create
  ceph: add new MDS req field to hold delegated inode number
  ceph: decode interval_sets for delegated inos
  ceph: make ceph_fill_inode non-static
  ceph: perform asynchronous unlink if we have sufficient caps
  ceph: don't take refs to want mask unless we have all bits
  ...
2020-04-08 21:44:05 -07:00
Joe Stringer
8e368dc72e bpf: Fix use of sk->sk_reuseport from sk_assign
In testing, we found that for request sockets the sk->sk_reuseport field
may yet be uninitialized, which caused bpf_sk_assign() to randomly
succeed or return -ESOCKTNOSUPPORT when handling the forward ACK in a
three-way handshake.

Fix it by only applying the reuseport check for full sockets.

Fixes: cf7fbe660f ("bpf: Add socket assign support")
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200408033540.10339-1-joe@wand.net.nz
2020-04-09 01:02:14 +02:00
Arnd Bergmann
f691a25ce5 net/tls: fix const assignment warning
Building with some experimental patches, I came across a warning
in the tls code:

include/linux/compiler.h:215:30: warning: assignment discards 'const' qualifier from pointer target type [-Wdiscarded-qualifiers]
  215 |  *(volatile typeof(x) *)&(x) = (val);  \
      |                              ^
net/tls/tls_main.c:650:4: note: in expansion of macro 'smp_store_release'
  650 |    smp_store_release(&saved_tcpv4_prot, prot);

This appears to be a legitimate warning about assigning a const pointer
into the non-const 'saved_tcpv4_prot' global. Annotate both the ipv4 and
ipv6 pointers 'const' to make the code internally consistent.

Fixes: 5bb4c45d46 ("net/tls: Read sk_prot once when building tls proto ops")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-08 14:34:02 -07:00
Michael Weiß
2abe05234f l2tp: Allow management of tunnels and session in user namespace
Creation and management of L2TPv3 tunnels and session through netlink
requires CAP_NET_ADMIN. However, a process with CAP_NET_ADMIN in a
non-initial user namespace gets an EPERM due to the use of the
genetlink GENL_ADMIN_PERM flag. Thus, management of L2TP VPNs inside
an unprivileged container won't work.

We replaced the GENL_ADMIN_PERM by the GENL_UNS_ADMIN_PERM flag
similar to other network modules which also had this problem, e.g.,
openvswitch (commit 4a92602aa1 "openvswitch: allow management from
inside user namespaces") and nl80211 (commit 5617c6cd6f "nl80211:
Allow privileged operations from user namespaces").

I tested this in the container runtime trustm3 (trustm3.github.io)
and was able to create l2tp tunnels and sessions in unpriviliged
(user namespaced) containers using a private network namespace.
For other runtimes such as docker or lxc this should work, too.

Signed-off-by: Michael Weiß <michael.weiss@aisec.fraunhofer.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-08 14:30:46 -07:00
Sonny Sasaka
ffee202a78 Bluetooth: Always request for user confirmation for Just Works (LE SC)
To improve security, always give the user-space daemon a chance to
accept or reject a Just Works pairing (LE). The daemon may decide to
auto-accept based on the user's intent.

This patch is similar to the previous patch but applies for LE Secure
Connections (SC).

Signed-off-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-08 22:18:32 +02:00
Jason A. Donenfeld
e6abef610c x86: update AS_* macros to binutils >=2.23, supporting ADX and AVX2
Now that the kernel specifies binutils 2.23 as the minimum version, we
can remove ifdefs for AVX2 and ADX throughout.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-09 00:12:48 +09:00
Jason A. Donenfeld
5e8ebd841a x86: probe assembler capabilities via kconfig instead of makefile
Doing this probing inside of the Makefiles means we have a maze of
ifdefs inside the source code and child Makefiles that need to make
proper decisions on this too. Instead, we do it at Kconfig time, like
many other compiler and assembler options, which allows us to set up the
dependencies normally for full compilation units. In the process, the
ADX test changes to use %eax instead of %r10 so that it's valid in both
32-bit and 64-bit mode.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-04-09 00:01:59 +09:00
Taehee Yoo
4faab8c446 hsr: check protocol version in hsr_newlink()
In the current hsr code, only 0 and 1 protocol versions are valid.
But current hsr code doesn't check the version, which is received by
userspace.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1 version 4

In the test commands, version 4 is invalid.
So, the command should be failed.

After this patch, following error will occur.
"Error: hsr: Only versions 0..1 are supported."

Fixes: ee1c279772 ("net/hsr: Added support for HSR v1")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:34:18 -07:00
Paul Blakey
a080da6ac7 net: sched: Fix setting last executed chain on skb extension
After driver sets the missed chain on the tc skb extension it is
consumed (deleted) by tc_classify_ingress and tc jumps to that chain.
If tc now misses on this chain (either no match, or no goto action),
then last executed chain remains 0, and the skb extension is not re-added,
and the next datapath (ovs) will start from 0.

Fix that by setting last executed chain to the chain read from the skb
extension, so if there is a miss, we set it back.

Fixes: af699626ee ("net: sched: Support specifying a starting chain via tc skb ext")
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:32:08 -07:00
Konstantin Khlebnikov
a4837980fd net: revert default NAPI poll timeout to 2 jiffies
For HZ < 1000 timeout 2000us rounds up to 1 jiffy but expires randomly
because next timer interrupt could come shortly after starting softirq.

For commonly used CONFIG_HZ=1000 nothing changes.

Fixes: 7acf8a1e8a ("Replace 2 jiffies with sysctl netdev_budget_usecs to enable softirq tuning")
Reported-by: Dmitry Yakunin <zeil@yandex-team.ru>
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:30:56 -07:00
Tim Stallard
b93cfb9cd3 net: icmp6: do not select saddr from iif when route has prefsrc set
Since commit fac6fce9bd ("net: icmp6: provide input address for
traceroute6") ICMPv6 errors have source addresses from the ingress
interface. However, this overrides when source address selection is
influenced by setting preferred source addresses on routes.

This can result in ICMP errors being lost to upstream BCP38 filters
when the wrong source addresses are used, breaking path MTU discovery
and traceroute.

This patch sets the modified source address selection to only take place
when the route used has no prefsrc set.

It can be tested with:

ip link add v1 type veth peer name v2
ip netns add test
ip netns exec test ip link set lo up
ip link set v2 netns test
ip link set v1 up
ip netns exec test ip link set v2 up
ip addr add 2001:db8::1/64 dev v1 nodad
ip addr add 2001:db8::3 dev v1 nodad
ip netns exec test ip addr add 2001:db8::2/64 dev v2 nodad
ip netns exec test ip route add unreachable 2001:db8:1::1
ip netns exec test ip addr add 2001:db8:100::1 dev lo
ip netns exec test ip route add 2001:db8::1 dev v2 src 2001:db8:100::1
ip route add 2001:db8:1000::1 via 2001:db8::2
traceroute6 -s 2001:db8::1 2001:db8:1000::1
traceroute6 -s 2001:db8::3 2001:db8:1000::1
ip netns delete test

Output before:
$ traceroute6 -s 2001:db8::1 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
 1  2001:db8::2 (2001:db8::2)  0.843 ms !N  0.396 ms !N  0.257 ms !N
$ traceroute6 -s 2001:db8::3 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
 1  2001:db8::2 (2001:db8::2)  0.772 ms !N  0.257 ms !N  0.357 ms !N

After:
$ traceroute6 -s 2001:db8::1 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
 1  2001:db8:100::1 (2001:db8:100::1)  8.885 ms !N  0.310 ms !N  0.174 ms !N
$ traceroute6 -s 2001:db8::3 2001:db8:1000::1
traceroute to 2001:db8:1000::1 (2001:db8:1000::1), 30 hops max, 80 byte packets
 1  2001:db8::2 (2001:db8::2)  1.403 ms !N  0.205 ms !N  0.313 ms !N

Fixes: fac6fce9bd ("net: icmp6: provide input address for traceroute6")
Signed-off-by: Tim Stallard <code@timstallard.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:25:10 -07:00
David S. Miller
c2c1128902 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net, they are:

1) Fix spurious overlap condition in the rbtree tree, from Stefano Brivio.

2) Fix possible uninitialized pointer dereference in nft_lookup.

3) IDLETIMER v1 target matches the Android layout, from
   Maciej Zenczykowski.

4) Dangling pointer in nf_tables_set_alloc_name, from Eric Dumazet.

5) Fix RCU warning splat in ipset find_set_type(), from Amol Grover.

6) Report EOPNOTSUPP on unsupported set flags and object types in sets.

7) Add NFT_SET_CONCAT flag to provide consistent error reporting
   when users defines set with ranges in concatenations in old kernels.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:08:06 -07:00
Linus Torvalds
04de788e61 NFS client updates for Linux 5.7
Highlights include:
 
 Stable fixes:
 - Fix a page leak in nfs_destroy_unlinked_subrequests()
 - Fix use-after-free issues in nfs_pageio_add_request()
 - Fix new mount code constant_table array definitions
 - finish_automount() requires us to hold 2 refs to the mount record
 
 Features:
 - Improve the accuracy of telldir/seekdir by using 64-bit cookies when
   possible.
 - Allow one RDMA active connection and several zombie connections to
   prevent blocking if the remote server is unresponsive.
 - Limit the size of the NFS access cache by default
 - Reduce the number of references to credentials that are taken by NFS
 - pNFS files and flexfiles drivers now support per-layout segment
   COMMIT lists.
 - Enable partial-file layout segments in the pNFS/flexfiles driver.
 - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type
 - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
   the DS using the layouterror mechanism.
 
 Bugfixes and cleanups:
 - SUNRPC: Fix krb5p regressions
 - Don't specify NFS version in "UDP not supported" error
 - nfsroot: set tcp as the default transport protocol
 - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()
 - alloc_nfs_open_context() must use the file cred when available
 - Fix locking when dereferencing the delegation cred
 - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails
 - Various clean ups of the NFS O_DIRECT commit code
 - Clean up RDMA connect/disconnect
 - Replace zero-length arrays with C99-style flexible arrays
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl6LhhsACgkQZwvnipYK
 APIOJxAAiQOgmIg1CV4mrlcVhkwy09N5JAia6AENtoTmwm08nAYg5Y8REb9uX46a
 /MJsM2WG8hBCgI6eYmRY8LTr4Ft9rTQEJM9DRMuwQREXwMWwBhUv/QakCeqY1lHE
 lyB1z4hj5XKeUoN/OcfALC/GXFFf56A0UyN05nMzeCkBTdd3+qu+hW8Ge1wkAXcr
 f0pyLbzdFZlJuTmI4tr8F93g9p3ezuFBuEroT7XPIVJylAdZVumHqnOnz/Mvb99x
 rNTsX2dc44GhSAfRnTzPumU3MT6BOLvUzNH1xzdiqKzJrbOnG8WjFodrGr3JWpfp
 HkeyYQxJ+Hnfb2LiZBjvMQE8M7kVMZ1jVbrGJEbCxfSqgTly8lOHboqAeKsFaReK
 LStnusizdA1LHQVZxPdvn+oL49RDxnzm9dY+DkrXK1qT0GE+icN1CyTyLLfkSCp8
 tYvZSJ/qPk5BNZegqH1nBqXkMDkOJ4eEA7+luXDmajRkdRrZ3IWY2M1DpMEoueJ2
 j/zoj/NFr1oErU4o7PV9oolA1Euhn1L3wIDuzsbVtjySmbXJNQTtaVVRFpGw3SsZ
 7rbqi4BB0SzOooNhQ4q8mLNi4qT7bl/3D04eL8UVzEM73plexhQ8XiOEz/VrIRX7
 L9viXH49g4DHQ0rZIaWefxFueqpgbNvQwnlLZl2uQotG9hwhTts=
 =YUcP
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Highlights include:

  Stable fixes:
   - Fix a page leak in nfs_destroy_unlinked_subrequests()

   - Fix use-after-free issues in nfs_pageio_add_request()

   - Fix new mount code constant_table array definitions

   - finish_automount() requires us to hold 2 refs to the mount record

  Features:
   - Improve the accuracy of telldir/seekdir by using 64-bit cookies
     when possible.

   - Allow one RDMA active connection and several zombie connections to
     prevent blocking if the remote server is unresponsive.

   - Limit the size of the NFS access cache by default

   - Reduce the number of references to credentials that are taken by
     NFS

   - pNFS files and flexfiles drivers now support per-layout segment
     COMMIT lists.

   - Enable partial-file layout segments in the pNFS/flexfiles driver.

   - Add support for CB_RECALL_ANY to the pNFS flexfiles layout type

   - pNFS/flexfiles Report NFS4ERR_DELAY and NFS4ERR_GRACE errors from
     the DS using the layouterror mechanism.

  Bugfixes and cleanups:
   - SUNRPC: Fix krb5p regressions

   - Don't specify NFS version in "UDP not supported" error

   - nfsroot: set tcp as the default transport protocol

   - pnfs: Return valid stateids in nfs_layout_find_inode_by_stateid()

   - alloc_nfs_open_context() must use the file cred when available

   - Fix locking when dereferencing the delegation cred

   - Fix memory leaks in O_DIRECT when nfs_get_lock_context() fails

   - Various clean ups of the NFS O_DIRECT commit code

   - Clean up RDMA connect/disconnect

   - Replace zero-length arrays with C99-style flexible arrays"

* tag 'nfs-for-5.7-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (86 commits)
  NFS: Clean up process of marking inode stale.
  SUNRPC: Don't start a timer on an already queued rpc task
  NFS/pnfs: Reference the layout cred in pnfs_prepare_layoutreturn()
  NFS/pnfs: Fix dereference of layout cred in pnfs_layoutcommit_inode()
  NFS: Beware when dereferencing the delegation cred
  NFS: Add a module parameter to set nfs_mountpoint_expiry_timeout
  NFS: finish_automount() requires us to hold 2 refs to the mount record
  NFS: Fix a few constant_table array definitions
  NFS: Try to join page groups before an O_DIRECT retransmission
  NFS: Refactor nfs_lock_and_join_requests()
  NFS: Reverse the submission order of requests in __nfs_pageio_add_request()
  NFS: Clean up nfs_lock_and_join_requests()
  NFS: Remove the redundant function nfs_pgio_has_mirroring()
  NFS: Fix memory leaks in nfs_pageio_stop_mirroring()
  NFS: Fix a request reference leak in nfs_direct_write_clear_reqs()
  NFS: Fix use-after-free issues in nfs_pageio_add_request()
  NFS: Fix races nfs_page_group_destroy() vs nfs_destroy_unlinked_subrequests()
  NFS: Fix a page leak in nfs_destroy_unlinked_subrequests()
  NFS: Remove unused FLUSH_SYNC support in nfs_initiate_pgio()
  pNFS/flexfiles: Specify the layout segment range in LAYOUTGET
  ...
2020-04-07 13:51:39 -07:00
Linus Torvalds
479a72c0c6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Slave bond and team devices should not be assigned ipv6 link local
    addresses, from Jarod Wilson.

 2) Fix clock sink config on some at803x PHY devices, from Oleksij
    Rempel.

 3) Uninitialized stack space transmitted in slcan frames, fix from
    Richard Palethorpe.

 4) Guard HW VLAN ops properly in stmmac driver, from Jose Abreu.

 5) "=" --> "|=" fix in aquantia driver, from Colin Ian King.

 6) Fix TCP fallback in mptcp, from Florian Westphal. (accessing a plain
    tcp_sk as if it were an mptcp socket).

 7) Fix cavium driver in some configurations wrt. PTP, from Yue Haibing.

 8) Make ipv6 and ipv4 consistent in the lower bound allowed for
    neighbour entry retrans_time, from Hangbin Liu.

 9) Don't use private workqueue in pegasus usb driver, from Petko
    Manolov.

10) Fix integer overflow in mlxsw, from Colin Ian King.

11) Missing refcnt init in cls_tcindex, from Cong Wang.

12) One too many loop iterations when processing cmpri entries in ipv6
    rpl code, from Alexander Aring.

13) Disable SG and TSO by default in r8169, from Heiner Kallweit.

14) NULL deref in macsec, from Davide Caratti.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (42 commits)
  macsec: fix NULL dereference in macsec_upd_offload()
  skbuff.h: Improve the checksum related comments
  net: dsa: bcm_sf2: Ensure correct sub-node is parsed
  qed: remove redundant assignment to variable 'rc'
  wimax: remove some redundant assignments to variable result
  mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_VLAN_MANGLE
  mlxsw: spectrum_flower: Do not stop at FLOW_ACTION_PRIORITY
  r8169: change back SG and TSO to be disabled by default
  net: dsa: bcm_sf2: Do not register slave MDIO bus with OF
  ipv6: rpl: fix loop iteration
  tun: Don't put_page() for all negative return values from XDP program
  net: dsa: mt7530: fix null pointer dereferencing in port5 setup
  mptcp: add some missing pr_fmt defines
  net: phy: micrel: kszphy_resume(): add delay after genphy_resume() before accessing PHY registers
  net_sched: fix a missing refcnt in tcindex_init()
  net: stmmac: dwmac1000: fix out-of-bounds mac address reg setting
  mlxsw: spectrum_trap: fix unintention integer overflow on left shift
  pegasus: Remove pegasus' own workqueue
  neigh: support smaller retrans_time settting
  net: openvswitch: use hlist_for_each_entry_rcu instead of hlist_for_each_entry
  ...
2020-04-07 12:03:32 -07:00
Guenter Roeck
d1d900f822 Bluetooth: Simplify / fix return values from tk_request
Some static checker run by 0day reports a variableScope warning.

net/bluetooth/smp.c:870:6: warning:
	The scope of the variable 'err' can be reduced. [variableScope]

There is no need for two separate variables holding return values.
Stick with the existing variable. While at it, don't pre-initialize
'ret' because it is set in each code path.

tk_request() is supposed to return a negative error code on errors,
not a bluetooth return code. The calling code converts the return
value to SMP_UNSPECIFIED if needed.

Fixes: 92516cd97f ("Bluetooth: Always request for user confirmation for Just Works")
Cc: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-07 18:34:53 +02:00
Archie Pusaka
c2aa30db74 Bluetooth: debugfs option to unset MITM flag
The BT qualification test SM/MAS/PKE/BV-01-C needs us to turn off
the MITM flag when pairing, and at the same time also set the io
capability to something other than no input no output.

Currently the MITM flag is only unset when the io capability is set
to no input no output, therefore the test cannot be executed.

This patch introduces a debugfs option to force MITM flag to be
turned off.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-07 18:32:21 +02:00
Pablo Neira Ayuso
ef516e8625 netfilter: nf_tables: reintroduce the NFT_SET_CONCAT flag
Stefano originally proposed to introduce this flag, users hit EOPNOTSUPP
in new binaries with old kernels when defining a set with ranges in
a concatenation.

Fixes: f3a2181e16 ("netfilter: nf_tables: Support for sets with multiple ranged fields")
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-07 18:23:04 +02:00
Pablo Neira Ayuso
d9583cdf2f netfilter: nf_tables: report EOPNOTSUPP on unsupported flags/object type
EINVAL should be used for malformed netlink messages. New userspace
utility and old kernels might easily result in EINVAL when exercising
new set features, which is misleading.

Fixes: 8aeff920dc ("netfilter: nf_tables: add stateful object reference to set elements")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-07 18:22:46 +02:00
Li RongQing
db5c97f023 xsk: Fix out of boundary write in __xsk_rcv_memcpy
first_len is the remainder of the first page we're copying.
If this size is larger, then out of page boundary write will
otherwise happen.

Fixes: c05cd36458 ("xsk: add support to allow unaligned chunk placement")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1585813930-19712-1-git-send-email-lirongqing@baidu.com
2020-04-06 21:48:05 +02:00
Alexander Aring
a7f9a6f4cc ipv6: rpl: fix loop iteration
This patch fix the loop iteration by not walking over the last
iteration. The cmpri compressing value exempt the last segment. As the
code shows the last iteration will be overwritten by cmpre value
handling which is for the last segment.

I think this doesn't end in any bufferoverflows because we work on worst
case temporary buffer sizes but it ends in not best compression settings
in some cases.

Fixes: 8610c7c6e3 ("net: ipv6: add support for rpl sr exthdr")
Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-06 10:05:42 -07:00
Linus Torvalds
e14679b62d 9p pull request for inclusion in 5.7
- Fix read with O_NONBLOCK to allow incomplete read and return immediately
 - Rest is just cleanup (indent, unused field in struct, extra semicolon)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl6LCyEACgkQq06b7GqY
 5nAAQg//ThE/rojrB74S8yaD3sJlfsT+BoMzGDeos9FGdsHpO8t1ISUiheAcUn/Y
 a6igfJxTefxEO5SWqvsGEousRl0sRTyLo8RuQ2CC7+zKN8Ou+eG+4s2QXfDxBD+p
 xJGJrsWD2SkQ98/hkxDY6d31BM9iQGvpvVp/c9qLwtvTlnvEMXll0JiACcnmiPRt
 z92bx8QkaPlt399qv7LltfHftxCHOMcieQlYAF3wDjLauiEdDWysiMn0NIC2izwY
 hPlsVCiV504yhvQFcXVnmbWVJe88THlBVVjuVAi9IwpU8D2+1UK1kM/yiyVkYml+
 U4JC7ZBQoqw2obSLPk5mZ6E7nGWfWnIBpt/B9dO7akP8pS9bv+uCjwg65tFEop5/
 z9nuyjoQT/dEoThveeoEoxpIbyd690n1vbhcTdGDFDX5nfglrxj0S6bNVW0k6e41
 9aPwWzAfU9Ip9GHwE6Ewf/y88mMI7rCP+o/pC8XkUUGiV+foFt3oO5wCcvGbxwgJ
 nnZ5oY+Ren/QXOmq/cN3RUWJvTRbc8TDBm6Fa0iGkk7d1OUBuRTWuBUXvxHMpO3A
 xHapDRddPcrZ9QQSJZGvMSYwG0ZUZ6MRL2qTU4X/3sIi0giApFkRMpmmo0HjvMAf
 PIFl2MG2Ok4A8yWQN98AMXy8vgs6ql6+Wb3W0b5dNFsAGEk7f9U=
 =wAe5
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.7' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "Not much new, but a few patches for this cycle:

   - Fix read with O_NONBLOCK to allow incomplete read and return
     immediately

   - Rest is just cleanup (indent, unused field in struct, extra
     semicolon)"

* tag '9p-for-5.7' of git://github.com/martinetd/linux:
  net/9p: remove unused p9_req_t aux field
  9p: read only once on O_NONBLOCK
  9pnet: allow making incomplete read requests
  9p: Remove unneeded semicolon
  9p: Fix Kconfig indentation
2020-04-06 08:46:59 -07:00
Amol Grover
b135fc0801 netfilter: ipset: Pass lockdep expression to RCU lists
ip_set_type_list is traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of ip_set_type_mutex.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-06 14:31:25 +02:00
Eric Dumazet
7fb6f78df7 netfilter: nf_tables: do not leave dangling pointer in nf_tables_set_alloc_name
If nf_tables_set_alloc_name() frees set->name, we better
clear set->name to avoid a future use-after-free or invalid-free.

BUG: KASAN: double-free or invalid-free in nf_tables_newset+0x1ed6/0x2560 net/netfilter/nf_tables_api.c:4148

CPU: 0 PID: 28233 Comm: syz-executor.0 Not tainted 5.6.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd3/0x315 mm/kasan/report.c:374
 kasan_report_invalid_free+0x61/0xa0 mm/kasan/report.c:468
 __kasan_slab_free+0x129/0x140 mm/kasan/common.c:455
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x109/0x2b0 mm/slab.c:3757
 nf_tables_newset+0x1ed6/0x2560 net/netfilter/nf_tables_api.c:4148
 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433
 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline]
 nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345
 ___sys_sendmsg+0x100/0x170 net/socket.c:2399
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45c849
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe5ca21dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007fe5ca21e6d4 RCX: 000000000045c849
RDX: 0000000000000000 RSI: 0000000020000c40 RDI: 0000000000000003
RBP: 000000000076bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000095b R14: 00000000004cc0e9 R15: 000000000076bf0c

Allocated by task 28233:
 save_stack+0x1b/0x80 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 __kasan_kmalloc mm/kasan/common.c:515 [inline]
 __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:488
 __do_kmalloc mm/slab.c:3656 [inline]
 __kmalloc_track_caller+0x159/0x790 mm/slab.c:3671
 kvasprintf+0xb5/0x150 lib/kasprintf.c:25
 kasprintf+0xbb/0xf0 lib/kasprintf.c:59
 nf_tables_set_alloc_name net/netfilter/nf_tables_api.c:3536 [inline]
 nf_tables_newset+0x1543/0x2560 net/netfilter/nf_tables_api.c:4088
 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433
 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline]
 nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345
 ___sys_sendmsg+0x100/0x170 net/socket.c:2399
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 28233:
 save_stack+0x1b/0x80 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 kasan_set_free_info mm/kasan/common.c:337 [inline]
 __kasan_slab_free+0xf7/0x140 mm/kasan/common.c:476
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x109/0x2b0 mm/slab.c:3757
 nf_tables_set_alloc_name net/netfilter/nf_tables_api.c:3544 [inline]
 nf_tables_newset+0x1f73/0x2560 net/netfilter/nf_tables_api.c:4088
 nfnetlink_rcv_batch+0x83a/0x1610 net/netfilter/nfnetlink.c:433
 nfnetlink_rcv_skb_batch net/netfilter/nfnetlink.c:543 [inline]
 nfnetlink_rcv+0x3af/0x420 net/netfilter/nfnetlink.c:561
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2345
 ___sys_sendmsg+0x100/0x170 net/socket.c:2399
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2432
 do_syscall_64+0xf6/0x7d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880a6032d00
 which belongs to the cache kmalloc-32 of size 32
The buggy address is located 0 bytes inside of
 32-byte region [ffff8880a6032d00, ffff8880a6032d20)
The buggy address belongs to the page:
page:ffffea0002980c80 refcount:1 mapcount:0 mapping:ffff8880aa0001c0 index:0xffff8880a6032fc1
flags: 0xfffe0000000200(slab)
raw: 00fffe0000000200 ffffea0002a3be88 ffffea00029b1908 ffff8880aa0001c0
raw: ffff8880a6032fc1 ffff8880a6032000 000000010000003e 0000000000000000
page dumped because: kasan: bad access detected

Fixes: 65038428b2 ("netfilter: nf_tables: allow to specify stateful expression in set definition")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05 23:26:38 +02:00
Maciej Żenczykowski
bc9fe6143d netfilter: xt_IDLETIMER: target v1 - match Android layout
Android has long had an extension to IDLETIMER to send netlink
messages to userspace, see:
  https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/include/uapi/linux/netfilter/xt_IDLETIMER.h#42
Note: this is idletimer target rev 1, there is no rev 0 in
the Android common kernel sources, see registration at:
  https://android.googlesource.com/kernel/common/+/refs/heads/android-mainline/net/netfilter/xt_IDLETIMER.c#483

When we compare that to upstream's new idletimer target rev 1:
  https://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next.git/tree/include/uapi/linux/netfilter/xt_IDLETIMER.h#n46

We immediately notice that these two rev 1 structs are the
same size and layout, and that while timer_type and send_nl_msg
are differently named and serve a different purpose, they're
at the same offset.

This makes them impossible to tell apart - and thus one cannot
know in a mixed Android/vanilla environment whether one means
timer_type or send_nl_msg.

Since this is iptables/netfilter uapi it introduces a problem
between iptables (vanilla vs Android) userspace and kernel
(vanilla vs Android) if the two don't match each other.

Additionally when at some point in the future Android picks up
5.7+ it's not at all clear how to resolve the resulting merge
conflict.

Furthermore, since upgrading the kernel on old Android phones
is pretty much impossible there does not seem to be an easy way
out of this predicament.

The only thing I've been able to come up with is some super
disgusting kernel version >= 5.7 check in the iptables binary
to flip between different struct layouts.

By adding a dummy field to the vanilla Linux kernel header file
we can force the two structs to be compatible with each other.

Long term I think I would like to deprecate send_nl_msg out of
Android entirely, but I haven't quite been able to figure out
exactly how we depend on it.  It seems to be very similar to
sysfs notifications but with some extra info.

Currently it's actually always enabled whenever Android uses
the IDLETIMER target, so we could also probably entirely
remove it from the uapi in favour of just always enabling it,
but again we can't upgrade old kernels already in the field.

(Also note that this doesn't change the structure's size,
as it is simply fitting into the pre-existing padding, and
that since 5.7 hasn't been released yet, there's still time
to make this uapi visible change)

Cc: Manoj Basapathi <manojbm@codeaurora.org>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05 23:26:37 +02:00
Pablo Neira Ayuso
a26c1e49c8 netfilter: nf_tables: do not update stateful expressions if lookup is inverted
Initialize set lookup matching element to NULL. Otherwise, the
NFT_LOOKUP_F_INV flag reverses the matching logic and it leads to
deference an uninitialized pointer to the matching element. Make sure
element data area and stateful expression are accessed if there is a
matching set element.

This patch undoes 24791b9aa1 ("netfilter: nft_set_bitmap: initialize set
element extension in lookups") which is not required anymore.

Fixes: 339706bc21 ("netfilter: nft_lookup: update element stateful expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05 23:26:36 +02:00
Stefano Brivio
72239f2795 netfilter: nft_set_rbtree: Drop spurious condition for overlap detection on insertion
Case a1. for overlap detection in __nft_rbtree_insert() is not a valid
one: start-after-start is not needed to detect any type of interval
overlap and it actually results in a false positive if, while
descending the tree, this is the only step we hit after starting from
the root.

This introduced a regression, as reported by Pablo, in Python tests
cases ip/ip.t and ip/numgen.t:

  ip/ip.t: ERROR: line 124: add rule ip test-ip4 input ip hdrlength vmap { 0-4 : drop, 5 : accept, 6 : continue } counter: This rule should not have failed.
  ip/numgen.t: ERROR: line 7: add rule ip test-ip4 pre dnat to numgen inc mod 10 map { 0-5 : 192.168.10.100, 6-9 : 192.168.20.200}: This rule should not have failed.

Drop case a1. and renumber others, so that they are a bit clearer. In
order for these diagrams to be readily understandable, a bigger rework
is probably needed, such as an ASCII art of the actual rbtree (instead
of a flattened version).

Shell script test sets/0044interval_overlap_0 should cover all
possible cases for false negatives, so I consider that test case still
sufficient after this change.

v2: Fix comments for cases a3. and b3.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05 23:26:36 +02:00
Marcel Holtmann
3679fe7d43 Bluetooth: Increment management interface revision
Increment the mgmt revision due to the recently added new commands.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:15 +03:00
Marcel Holtmann
bc292258c5 Bluetooth: Add support for reading security information
To allow userspace to make correcty security policy decision, the kernel
needs to export a few details of the supported security features and
encryption key size information. This command exports this information
and also allows future extensions if needed.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:13 +03:00
Marcel Holtmann
a479036041 Bluetooth: Add support for Read Local Simple Pairing Options
With the Read Local Simple Pairing Options command it is possible to
retrieve the support for max encryption key size supported by the
controller and also if the controller correctly verifies the ECDH public
key during pairing.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:11 +03:00
Miao-chen Chou
145373cb1b Bluetooth: Add framework for Microsoft vendor extension
Micrsoft defined a set for HCI vendor extensions. Check the following
link for details:

https://docs.microsoft.com/en-us/windows-hardware/drivers/bluetooth/microsoft-defined-bluetooth-hci-commands-and-events

This provides the basic framework to enable the extension and read its
supported features. Drivers still have to declare support for this
extension before it can be utilized by the host stack.

Signed-off-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:05 +03:00
Marcel Holtmann
3d2336042a Bluetooth: Move debugfs configuration above the selftests
This is just a cosmetic clean to move the selftests configuration option
to the bottom of the list of options.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:02 +03:00
Sathish Narsimman
1f8330ea16 Bluetooth: add support to notify using SCO air mode
notifying using HCI_NOTIFY_CONN_ADD for SCO connection is generic in
case of mSBC audio. To differntiate SCO air mode introducing
HCI_NOTIFY_ENABLE_SCO_CVSD and HCI_NOTIFY_ENABLE_SCO_TRANSP.

Signed-off-by: Sathish Narsimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:52:38 +03:00
Trond Myklebust
1fab7dc477 SUNRPC: Don't start a timer on an already queued rpc task
Move the test for whether a task is already queued to prevent
corruption of the timer list in __rpc_sleep_on_priority_timeout().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-04-04 19:59:27 -04:00
Linus Torvalds
4c205c84e2 Keyrings fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl6AiV4ACgkQ+7dXa6fL
 C2uCDg/+NevZHjgevGpFS9WByBtzFXKbewSSO0VZqS8RcFcy7+jRdTkcP66HJkGD
 3JxfI4wQm2LOiaX8tHDlmWIPfp3G5Tnjae8peOEVBrCZ5WZMcD3CzquMv18kd8KK
 5iiFzsTYDCywP/BwHCJgVQjPNpSp9drNCL5T+oql0nWeEUVAmiVziTIZgM9cAiyj
 S/sfe76KxdNzaEEvpL5Mg/ieq1es/ssGA1jZ8jZI+YfN8mtBBH8KebckskKSgVTK
 OjBLQAEanCbq3UsEqSsvEqbBpK7JkQJPOE153VRr6Nq/0MDSniwZjqIYOkgeB9pR
 YStIrK9LyL/D0aMe8A52I7Ml1zuUUXb9zVAo3yLgubWPDYXvLs8n8Cgl8ZCQBFXy
 0t86rlSq9SooGf3M2ket8Gk/XgymcRNP9LHr6MUGEO23l2ELBoO8hsWvUTAHoKVx
 kn27S4YceW4+5UYfYm87ZQpUTbKPNATuBkts+QxiSrZMCnk82G6keA8JqNgKCe4d
 xvUJew/JEGx2J0T8vYiBfpB1zEbtYdluglYyyzHpl0wkafYGRj1tTvwRZwQNhxw5
 IFQvxEK2kVFTKcoLBGq909udb4QlQOfMAYal0u/4iCNiipNgxZawcWSfIMGN4qbG
 tPvvBL2ocRmycmoMS7EHwWUSWIzwggi7zosDsM8jfLHBshBLeAE=
 =vvN4
 -----END PGP SIGNATURE-----

Merge tag 'keys-fixes-20200329' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyrings fixes from David Howells:
 "Here's a couple of patches that fix a circular dependency between
  holding key->sem and mm->mmap_sem when reading data from a key.

  One potential issue is that a filesystem looking to use a key inside,
  say, ->readpages() could deadlock if the key being read is the key
  that's required and the buffer the key is being read into is on a page
  that needs to be fetched.

  The case actually detected is a bit more involved - with a filesystem
  calling request_key() and locking the target keyring for write - which
  could be being read"

* tag 'keys-fixes-20200329' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  KEYS: Avoid false positive ENOMEM error on key read
  KEYS: Don't write out to userspace while holding key semaphore
2020-04-04 12:24:47 -07:00
Linus Torvalds
b3d8e42282 Highlights:
- Fix EXCHANGE_ID response when NFSD runs in a container
 - A battery of new static trace points
 - Socket transports now use bio_vec to send Replies
 - NFS/RDMA now supports filesystems with no .splice_read method
 - Favor memcpy() over DMA mapping for small RPC/RDMA Replies
 - Add pre-requisites for supporting multiple Write chunks
 - Numerous minor fixes and clean-ups
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2.0.22 (GNU/Linux)
 
 iQIcBAABAgAGBQJegj9pAAoJEDNqszNvZn+XNGgP/RsRul/UGe70YoPS6AwxI+c1
 2JVni5LV83aVGSN1df/xRNdugWh4j8e8stBIJPCnWFzUERFvrzVeVyW0/dlIy37l
 SRL1L62EzFUejAL45O+CkF5+KI2kAWMgDCv+rPnFnIuXVa/sThx63F1AJikVMPjB
 7We3vd5Kh/CrMeMflebJYuY12xE6di2b3ifkZRO0/yuMaAuqJrDreYf4L6xpA4rC
 QnKQcNl7LGlOwGSI2WvDrCLE056PJFhTuzTawI80NKnkXMMFNc6/7NoXJqasVlHG
 fiki2mHbJrbYd8isIm3Vl/QkFsM8QjijtpVxC9gd151w0P7DfpMYmSzlZL7nvq/R
 Nt6IIqbaxWSS1VULsuS7rDtBwwZpW/LRWaUhEvMwimR2jeOxcwtlDVTX/dRH2mxq
 Ume64Hn8xMEhhx9tHCPQ+Rgjqv5m+ZEAvmV6B7RM9nT2z2MSzQQESeMB14VZZmF/
 2oH1dDCVdCmb4ZOcD5yxL6Y1hijn45s+YHdts9uIsCudKYPI906vPhogFC+PVJv+
 MrOiUf8d40H0ra8VAUFCjAceOulkv90aLhBjoHbPsP4SQOTsRuUXnsKESZpSHY72
 nT/uPM23ULv4kQ6tHB8yQ3ordjCBRgb4zIKtotc3Wpi7dhO8u6ptPj4soiflRShO
 8/3N5dYfqdt9FRyr7Z8/
 =o5G0
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfsd updates from Chuck Lever:

 - Fix EXCHANGE_ID response when NFSD runs in a container

 - A battery of new static trace points

 - Socket transports now use bio_vec to send Replies

 - NFS/RDMA now supports filesystems with no .splice_read method

 - Favor memcpy() over DMA mapping for small RPC/RDMA Replies

 - Add pre-requisites for supporting multiple Write chunks

 - Numerous minor fixes and clean-ups

[ Chuck is filling in for Bruce this time while he and his family settle
  into a new house ]

* tag 'nfsd-5.7' of git://git.linux-nfs.org/projects/cel/cel-2.6: (39 commits)
  svcrdma: Fix leak of transport addresses
  SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
  SUNRPC/cache: don't allow invalid entries to be flushed
  nfsd: fsnotify on rmdir under nfsd/clients/
  nfsd4: kill warnings on testing stateids with mismatched clientids
  nfsd: remove read permission bit for ctl sysctl
  NFSD: Fix NFS server build errors
  sunrpc: Add tracing for cache events
  SUNRPC/cache: Allow garbage collection of invalid cache entries
  nfsd: export upcalls must not return ESTALE when mountd is down
  nfsd: Add tracepoints for update of the expkey and export cache entries
  nfsd: Add tracepoints for exp_find_key() and exp_get_by_name()
  nfsd: Add tracing to nfsd_set_fh_dentry()
  nfsd: Don't add locks to closed or closing open stateids
  SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
  SUNRPC: Refactor xs_sendpages()
  svcrdma: Avoid DMA mapping small RPC Replies
  svcrdma: Fix double sync of transport header buffer
  svcrdma: Refactor chunk list encoders
  SUNRPC: Add encoders for list item discriminators
  ...
2020-04-04 11:13:51 -07:00
Geliang Tang
c85adced95 mptcp: add some missing pr_fmt defines
Some of the mptcp logs didn't print out the format string:

[  185.651493] DSS
[  185.651494] data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1
[  185.651494] data_ack=13792750332298763796
[  185.651495] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 data_avail=0 skb=0000000063dc595d
[  185.651495] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 status=0
[  185.651495] MPTCP: msk ack_seq=9bbc894565aa2f9a subflow ack_seq=9bbc894565aa2f9a
[  185.651496] MPTCP: msk=00000000c4b81cfc ssk=000000009743af53 data_avail=1 skb=0000000012e809e1

So this patch added these missing pr_fmt defines. Then we can get the same
format string "MPTCP" in all mptcp logs like this:

[  142.795829] MPTCP: DSS
[  142.795829] MPTCP: data_fin=0 dsn64=0 use_map=0 ack64=1 use_ack=1
[  142.795829] MPTCP: data_ack=8089704603109242421
[  142.795830] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 data_avail=0 skb=00000000d5f230df
[  142.795830] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 status=0
[  142.795831] MPTCP: msk ack_seq=66790290f1199d9b subflow ack_seq=66790290f1199d9b
[  142.795831] MPTCP: msk=00000000133a24e0 ssk=000000002e508c64 data_avail=1 skb=00000000de5aca2e

Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-03 16:06:32 -07:00
Cong Wang
a8eab6d35e net_sched: fix a missing refcnt in tcindex_init()
The initial refcnt of struct tcindex_data should be 1,
it is clear that I forgot to set it to 1 in tcindex_init().
This leads to a dec-after-zero warning.

Reported-by: syzbot+8325e509a1bf83ec741d@syzkaller.appspotmail.com
Fixes: 304e024216 ("net_sched: add a temporary refcnt for struct tcindex_data")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-03 16:00:08 -07:00
Linus Torvalds
ff2ae607c6 SPDX patches for 5.7-rc1.
Here are 3 SPDX patches for 5.7-rc1.
 
 One fixes up the SPDX tag for a single driver, while the other two go
 through the tree and add SPDX tags for all of the .gitignore files as
 needed.
 
 Nothing too complex, but you will get a merge conflict with your current
 tree, that should be trivial to handle (one file modified by two things,
 one file deleted.)
 
 All 3 of these have been in linux-next for a while, with no reported
 issues other than the merge conflict.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXodg5A8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykySQCgy9YDrkz7nWq6v3Gohl6+lW/L+rMAnRM4uTZm
 m5AuCzO3Azt9KBi7NL+L
 =2Lm5
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull SPDX updates from Greg KH:
 "Here are three SPDX patches for 5.7-rc1.

  One fixes up the SPDX tag for a single driver, while the other two go
  through the tree and add SPDX tags for all of the .gitignore files as
  needed.

  Nothing too complex, but you will get a merge conflict with your
  current tree, that should be trivial to handle (one file modified by
  two things, one file deleted.)

  All three of these have been in linux-next for a while, with no
  reported issues other than the merge conflict"

* tag 'spdx-5.7-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx:
  ASoC: MT6660: make spdxcheck.py happy
  .gitignore: add SPDX License Identifier
  .gitignore: remove too obvious comments
2020-04-03 13:12:26 -07:00
Alain Michaud
1e5479be46 Bluetooth: fixing minor typo in comment
This changes a simple typo in hci_event.c

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-03 15:49:56 +02:00
Abhishek Pandit-Subedi
7fedd3bb6b Bluetooth: Prioritize SCO traffic
When scheduling TX packets, send all SCO/eSCO packets first, check for
pending SCO/eSCO packets after every ACL/LE packet and send them if any
are pending.  This is done to make sure that we can meet SCO deadlines
on slow interfaces like UART.

If we were to queue up multiple ACL packets without checking for a SCO
packet, we might miss the SCO timing. For example:

The time it takes to send a maximum size ACL packet (1024 bytes):
t = 10/8 * 1024 bytes * 8 bits/byte * 1 packet / baudrate
        where 10/8 is uart overhead due to start/stop bits per byte

Replace t = 3.75ms (SCO deadline), which gives us a baudrate of 2730666.

At a baudrate of 3000000, if we didn't check for SCO packets within 1024
bytes, we would miss the 3.75ms timing window.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-03 08:47:09 +02:00
Hangbin Liu
19e16d220f neigh: support smaller retrans_time settting
Currently, we limited the retrans_time to be greater than HZ/2. i.e.
setting retrans_time less than 500ms will not work. This makes the user
unable to achieve a more accurate control for bonding arp fast failover.

Update the sanity check to HZ/100, which is 10ms, to let users have more
ability on the retrans_time control.

v3: sync the behavior with IPv6 and update all the timer handler
v2: use HZ instead of hard code number

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 17:55:26 -07:00
Tonghao Zhang
64948427a6 net: openvswitch: use hlist_for_each_entry_rcu instead of hlist_for_each_entry
The struct sw_flow is protected by RCU, when traversing them,
use hlist_for_each_entry_rcu.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Tested-by: Greg Rose <gvrose8192@gmail.com>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 17:48:12 -07:00
Vincent Bernat
c427bfec18 net: core: enable SO_BINDTODEVICE for non-root users
Currently, SO_BINDTODEVICE requires CAP_NET_RAW. This change allows a
non-root user to bind a socket to an interface if it is not already
bound. This is useful to allow an application to bind itself to a
specific VRF for outgoing or incoming connections. Currently, an
application wanting to manage connections through several VRF need to
be privileged.

Previously, IP_UNICAST_IF and IPV6_UNICAST_IF were added for
Wine (76e21053b5 and c4062dfc42) specifically for use by
non-root processes. However, they are restricted to sendmsg() and not
usable with TCP. Allowing SO_BINDTODEVICE would allow TCP clients to
get the same privilege. As for TCP servers, outside the VRF use case,
SO_BINDTODEVICE would only further restrict connections a server could
accept.

When an application is restricted to a VRF (with `ip vrf exec`), the
socket is bound to an interface at creation and therefore, a
non-privileged call to SO_BINDTODEVICE to escape the VRF fails.

When an application bound a socket to SO_BINDTODEVICE and transmit it
to a non-privileged process through a Unix socket, a tentative to
change the bound device also fails.

Before:

    >>> import socket
    >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    PermissionError: [Errno 1] Operation not permitted

After:

    >>> import socket
    >>> s=socket.socket(socket.AF_INET, socket.SOCK_STREAM)
    >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
    >>> s.setsockopt(socket.SOL_SOCKET, socket.SO_BINDTODEVICE, b"dummy0")
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    PermissionError: [Errno 1] Operation not permitted

Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 17:46:43 -07:00
Jakub Sitnicki
7a1ca97269 net, sk_msg: Don't use RCU_INIT_POINTER on sk_user_data
sparse reports an error due to use of RCU_INIT_POINTER helper to assign to
sk_user_data pointer, which is not tagged with __rcu:

net/core/sock.c:1875:25: error: incompatible types in comparison expression (different address spaces):
net/core/sock.c:1875:25:    void [noderef] <asn:4> *
net/core/sock.c:1875:25:    void *

... and rightfully so. sk_user_data is not always treated as a pointer to
an RCU-protected data. When it is used to point at an RCU-protected object,
we access it with __sk_user_data to inform sparse about it.

In this case, when the child socket does not inherit sk_user_data from the
parent, there is no reason to treat it as an RCU-protected pointer.

Use a regular assignment to clear the pointer value.

Fixes: f1ff5ce2cd ("net, sk_msg: Clear sk_user_data pointer on clone if tagged")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200402125524.851439-1-jakub@cloudflare.com
2020-04-03 00:36:37 +02:00
Matthieu Baerts
564cf2f395 mptcp: fix "fn parameter not described" warnings
Obtained with:

  $ make W=1 net/mptcp/token.o
  net/mptcp/token.c:53: warning: Function parameter or member 'req' not described in 'mptcp_token_new_request'
  net/mptcp/token.c:98: warning: Function parameter or member 'sk' not described in 'mptcp_token_new_connect'
  net/mptcp/token.c:133: warning: Function parameter or member 'conn' not described in 'mptcp_token_new_accept'
  net/mptcp/token.c:178: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy_request'
  net/mptcp/token.c:191: warning: Function parameter or member 'token' not described in 'mptcp_token_destroy'

Fixes: 79c0949e9a (mptcp: Add key generation and token tree)
Fixes: 58b0991962 (mptcp: create msk early)
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 06:59:21 -07:00
Florian Westphal
de06f57392 mptcp: re-check dsn before reading from subflow
mptcp_subflow_data_available() is commonly called via
ssk->sk_data_ready(), in this case the mptcp socket lock
cannot be acquired.

Therefore, while we can safely discard subflow data that
was already received up to msk->ack_seq, we cannot be sure
that 'subflow->data_avail' will still be valid at the time
userspace wants to read the data -- a previous read on a
different subflow might have carried this data already.

In that (unlikely) event, msk->ack_seq will have been updated
and will be ahead of the subflow dsn.

We can check for this condition and skip/resync to the expected
sequence number.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 06:59:21 -07:00
Florian Westphal
59832e2465 mptcp: subflow: check parent mptcp socket on subflow state change
This is needed at least until proper MPTCP-Level fin/reset
signalling gets added:

We wake parent when a subflow changes, but we should do this only
when all subflows have closed, not just one.

Schedule the mptcp worker and tell it to check eof state on all
subflows.

Only flag mptcp socket as closed and wake userspace processes blocking
in poll if all subflows have closed.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 06:59:21 -07:00
Florian Westphal
0b4f33def7 mptcp: fix tcp fallback crash
Christoph Paasch reports following crash:

general protection fault [..]
CPU: 0 PID: 2874 Comm: syz-executor072 Not tainted 5.6.0-rc5 #62
RIP: 0010:__pv_queued_spin_lock_slowpath kernel/locking/qspinlock.c:471
[..]
 queued_spin_lock_slowpath arch/x86/include/asm/qspinlock.h:50 [inline]
 do_raw_spin_lock include/linux/spinlock.h:181 [inline]
 spin_lock_bh include/linux/spinlock.h:343 [inline]
 __mptcp_flush_join_list+0x44/0xb0 net/mptcp/protocol.c:278
 mptcp_shutdown+0xb3/0x230 net/mptcp/protocol.c:1882
[..]

Problem is that mptcp_shutdown() socket isn't an mptcp socket,
its a plain tcp_sk.  Thus, trying to access mptcp_sk specific
members accesses garbage.

Root cause is that accept() returns a fallback (tcp) socket, not an mptcp
one.  There is code in getpeername to detect this and override the sockets
stream_ops.  But this will only run when accept() caller provided a
sockaddr struct.  "accept(fd, NULL, 0)" will therefore result in
mptcp stream ops, but with sock->sk pointing at a tcp_sk.

Update the existing fallback handling to detect this as well.

Moreover, mptcp_shutdown did not have fallback handling, and
mptcp_poll did it too late so add that there as well.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 06:59:21 -07:00
Colin Ian King
d16fa75925 net: ipv6: rpl_iptunnel: remove redundant assignments to variable err
The variable err is being initialized with a value that is never
read and it is being updated later with a new value.  The initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 06:57:34 -07:00
kbuild test robot
bf88dc327d net: dsa: dsa_bridge_mtu_normalization() can be static
Fixes: f41071407c85 ("net: dsa: implement auto-normalization of MTU for bridge hardware datapath")
Signed-off-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-02 06:51:56 -07:00
Sonny Sasaka
92516cd97f Bluetooth: Always request for user confirmation for Just Works
To improve security, always give the user-space daemon a chance to
accept or reject a Just Works pairing (LE). The daemon may decide to
auto-accept based on the user's intent.

Signed-off-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-02 08:25:19 +02:00
Luiz Augusto von Dentz
3ee7b7cd83 Bluetooth: Add BT_MODE socket option
This adds BT_MODE socket option which can be used to set L2CAP modes,
including modes only supported over LE which were not supported using
the L2CAP_OPTIONS.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-02 08:25:19 +02:00
Luiz Augusto von Dentz
b86b0b150f Bluetooth: L2CAP: Fix handling LE modes by L2CAP_OPTIONS
L2CAP_OPTIONS shall only be used with BR/EDR modes.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-02 08:25:19 +02:00
Jarod Wilson
744fdc8233 ipv6: don't auto-add link-local address to lag ports
Bonding slave and team port devices should not have link-local addresses
automatically added to them, as it can interfere with openvswitch being
able to properly add tc ingress.

Basic reproducer, courtesy of Marcelo:

$ ip link add name bond0 type bond
$ ip link set dev ens2f0np0 master bond0
$ ip link set dev ens2f1np2 master bond0
$ ip link set dev bond0 up
$ ip a s
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN
group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host
       valid_lft forever preferred_lft forever
2: ens2f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
mq master bond0 state UP group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
5: ens2f1np2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc
mq master bond0 state DOWN group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
11: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 1500 qdisc
noqueue state UP group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20f:53ff:fe2f:ea40/64 scope link
       valid_lft forever preferred_lft forever

(above trimmed to relevant entries, obviously)

$ sysctl net.ipv6.conf.ens2f0np0.addr_gen_mode=0
net.ipv6.conf.ens2f0np0.addr_gen_mode = 0
$ sysctl net.ipv6.conf.ens2f1np2.addr_gen_mode=0
net.ipv6.conf.ens2f1np2.addr_gen_mode = 0

$ ip a l ens2f0np0
2: ens2f0np0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 1500 qdisc
mq master bond0 state UP group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20f:53ff:fe2f:ea40/64 scope link tentative
       valid_lft forever preferred_lft forever
$ ip a l ens2f1np2
5: ens2f1np2: <NO-CARRIER,BROADCAST,MULTICAST,SLAVE,UP> mtu 1500 qdisc
mq master bond0 state DOWN group default qlen 1000
    link/ether 00:0f:53:2f:ea:40 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::20f:53ff:fe2f:ea40/64 scope link tentative
       valid_lft forever preferred_lft forever

Looks like addrconf_sysctl_addr_gen_mode() bypasses the original "is
this a slave interface?" check added by commit c2edacf80e, and
results in an address getting added, while w/the proposed patch added,
no address gets added. This simply adds the same gating check to another
code path, and thus should prevent the same devices from erroneously
obtaining an ipv6 link-local address.

Fixes: d35a00b8e3 ("net/ipv6: allow sysctl to change link-local address generation mode")
Reported-by: Moshe Levi <moshele@mellanox.com>
CC: Stephen Hemminger <stephen@networkplumber.org>
CC: Marcelo Ricardo Leitner <mleitner@redhat.com>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-01 11:13:09 -07:00
Cong Wang
304e024216 net_sched: add a temporary refcnt for struct tcindex_data
Although we intentionally use an ordered workqueue for all tc
filter works, the ordering is not guaranteed by RCU work,
given that tcf_queue_work() is esstenially a call_rcu().

This problem is demostrated by Thomas:

  CPU 0:
    tcf_queue_work()
      tcf_queue_work(&r->rwork, tcindex_destroy_rexts_work);

  -> Migration to CPU 1

  CPU 1:
     tcf_queue_work(&p->rwork, tcindex_destroy_work);

so the 2nd work could be queued before the 1st one, which leads
to a free-after-free.

Enforcing this order in RCU work is hard as it requires to change
RCU code too. Fortunately we can workaround this problem in tcindex
filter by taking a temporary refcnt, we only refcnt it right before
we begin to destroy it. This simplifies the code a lot as a full
refcnt requires much more changes in tcindex_set_parms().

Reported-by: syzbot+46f513c3033d592409d2@syzkaller.appspotmail.com
Fixes: 3d210534cc ("net_sched: fix a race condition in tcindex_destroy()")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Paul E. McKenney <paulmck@kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-01 11:06:23 -07:00
Linus Torvalds
29d9f30d4c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Fix the iwlwifi regression, from Johannes Berg.

   2) Support BSS coloring and 802.11 encapsulation offloading in
      hardware, from John Crispin.

   3) Fix some potential Spectre issues in qtnfmac, from Sergey
      Matyukevich.

   4) Add TTL decrement action to openvswitch, from Matteo Croce.

   5) Allow paralleization through flow_action setup by not taking the
      RTNL mutex, from Vlad Buslov.

   6) A lot of zero-length array to flexible-array conversions, from
      Gustavo A. R. Silva.

   7) Align XDP statistics names across several drivers for consistency,
      from Lorenzo Bianconi.

   8) Add various pieces of infrastructure for offloading conntrack, and
      make use of it in mlx5 driver, from Paul Blakey.

   9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.

  10) Lots of parallelization improvements during configuration changes
      in mlxsw driver, from Ido Schimmel.

  11) Add support to devlink for generic packet traps, which report
      packets dropped during ACL processing. And use them in mlxsw
      driver. From Jiri Pirko.

  12) Support bcmgenet on ACPI, from Jeremy Linton.

  13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
      Starovoitov, and your's truly.

  14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.

  15) Fix sysfs permissions when network devices change namespaces, from
      Christian Brauner.

  16) Add a flags element to ethtool_ops so that drivers can more simply
      indicate which coalescing parameters they actually support, and
      therefore the generic layer can validate the user's ethtool
      request. Use this in all drivers, from Jakub Kicinski.

  17) Offload FIFO qdisc in mlxsw, from Petr Machata.

  18) Support UDP sockets in sockmap, from Lorenz Bauer.

  19) Fix stretch ACK bugs in several TCP congestion control modules,
      from Pengcheng Yang.

  20) Support virtual functiosn in octeontx2 driver, from Tomasz
      Duszynski.

  21) Add region operations for devlink and use it in ice driver to dump
      NVM contents, from Jacob Keller.

  22) Add support for hw offload of MACSEC, from Antoine Tenart.

  23) Add support for BPF programs that can be attached to LSM hooks,
      from KP Singh.

  24) Support for multiple paths, path managers, and counters in MPTCP.
      From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
      and others.

  25) More progress on adding the netlink interface to ethtool, from
      Michal Kubecek"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
  net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
  cxgb4/chcr: nic-tls stats in ethtool
  net: dsa: fix oops while probing Marvell DSA switches
  net/bpfilter: remove superfluous testing message
  net: macb: Fix handling of fixed-link node
  net: dsa: ksz: Select KSZ protocol tag
  netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
  net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
  net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
  net: stmmac: create dwmac-intel.c to contain all Intel platform
  net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
  net: dsa: bcm_sf2: Add support for matching VLAN TCI
  net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
  net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
  net: dsa: bcm_sf2: Disable learning for ASP port
  net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
  net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
  net: dsa: b53: Restore VLAN entries upon (re)configuration
  net: dsa: bcm_sf2: Fix overflow checks
  hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
  ...
2020-03-31 17:29:33 -07:00
Gustavo A. R. Silva
7f80ccfe99 net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
In case memory resources for buf were allocated, release them before
return.

Addresses-Coverity-ID: 1492011 ("Resource leak")
Fixes: a7a29f9c36 ("net: ipv6: add rpl sr tunnel")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31 10:12:51 -07:00
Russell King
765bda93d0 net: dsa: fix oops while probing Marvell DSA switches
Fix an oops in dsa_port_phylink_mac_change() caused by a combination
of a20f997010 ("net: dsa: Don't instantiate phylink for CPU/DSA
ports unless needed") and the net-dsa-improve-serdes-integration
series of patches 65b7a2c8e3 ("Merge branch
'net-dsa-improve-serdes-integration'").

Unable to handle kernel NULL pointer dereference at virtual address 00000124
pgd = c0004000
[00000124] *pgd=00000000
Internal error: Oops: 805 [#1] SMP ARM
Modules linked in: tag_edsa spi_nor mtd xhci_plat_hcd mv88e6xxx(+) xhci_hcd armada_thermal marvell_cesa dsa_core ehci_orion libdes phy_armada38x_comphy at24 mcp3021 sfp evbug spi_orion sff mdio_i2c
CPU: 1 PID: 214 Comm: irq/55-mv88e6xx Not tainted 5.6.0+ #470
Hardware name: Marvell Armada 380/385 (Device Tree)
PC is at phylink_mac_change+0x10/0x88
LR is at mv88e6352_serdes_irq_status+0x74/0x94 [mv88e6xxx]

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31 10:09:07 -07:00
Bruno Meneguele
41c55ea6c2 net/bpfilter: remove superfluous testing message
A testing message was brought by 13d0f7b814 ("net/bpfilter: fix dprintf
usage for /dev/kmsg") but should've been deleted before patch submission.
Although it doesn't cause any harm to the code or functionality itself, it's
totally unpleasant to have it displayed on every loop iteration with no real
use case. Thus remove it unconditionally.

Fixes: 13d0f7b814 ("net/bpfilter: fix dprintf usage for /dev/kmsg")
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-31 10:07:51 -07:00
David S. Miller
5a470b1a63 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-03-30 20:48:43 -07:00
David S. Miller
ed52f2c608 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 19:52:37 -07:00
David S. Miller
d9679cd985 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Add support to specify a stateful expression in set definitions,
   this allows users to specify e.g. counters per set elements.

2) Flowtable software counter support.

3) Flowtable hardware offload counter support, from wenxu.

3) Parallelize flowtable hardware offload requests, from Paul Blakey.
   This includes a patch to add one work entry per offload command.

4) Several patches to rework nf_queue refcount handling, from Florian
   Westphal.

4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
   information is missing and set up indirect flow block as TC_SETUP_FT,
   patch from wenxu.

5) Stricter netlink attribute sanity check on filters, from Romain Bellan
   and Florent Fourcot.

5) Annotations to make sparse happy, from Jules Irenge.

6) Improve icmp errors in debugging information, from Haishuang Yan.

7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.

8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 19:40:46 -07:00
Ido Schimmel
c064875a63 devlink: Allow setting of packet trap group parameters
The previous patch allowed device drivers to publish their default
binding between packet trap policers and packet trap groups. However,
some users might not be content with this binding and would like to
change it.

In case user space passed a packet trap policer identifier when setting
a packet trap group, invoke the appropriate device driver callback and
pass the new policer identifier.

v2:
* Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
  devlink_trap_group_set() and bail if not present
* Add extack error message in case trap group was partially modified

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 17:54:58 -07:00
Ido Schimmel
f9f54392d2 devlink: Add packet trap group parameters support
Packet trap groups are used to aggregate logically related packet traps.
Currently, these groups allow user space to batch operations such as
setting the trap action of all member traps.

In order to prevent the CPU from being overwhelmed by too many trapped
packets, it is desirable to bind a packet trap policer to these groups.
For example, to limit all the packets that encountered an exception
during routing to 10Kpps.

Allow device drivers to bind default packet trap policers to packet trap
groups when the latter are registered with devlink.

The next patch will enable user space to change this default binding.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 17:54:58 -07:00
Ido Schimmel
1e8c661989 devlink: Add packet trap policers support
Devices capable of offloading the kernel's datapath and perform
functions such as bridging and routing must also be able to send (trap)
specific packets to the kernel (i.e., the CPU) for processing.

For example, a device acting as a multicast-aware bridge must be able to
trap IGMP membership reports to the kernel for processing by the bridge
module.

In most cases, the underlying device is capable of handling packet rates
that are several orders of magnitude higher compared to those that can
be handled by the CPU.

Therefore, in order to prevent the underlying device from overwhelming
the CPU, devices usually include packet trap policers that are able to
police the trapped packets to rates that can be handled by the CPU.

This patch allows capable device drivers to register their supported
packet trap policers with devlink. User space can then tune the
parameters of these policer (currently, rate and burst size) and read
from the device the number of packets that were dropped by the policer,
if supported.

Subsequent patches in the series will allow device drivers to create
default binding between these policers and packet trap groups and allow
user space to change the binding.

v2:
* Add 'strict_start_type' in devlink policy
* Have device drivers provide max/min rate/burst size for each policer.
  Use them to check validity of user provided parameters

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 17:54:58 -07:00
Joe Stringer
7ae215d23c bpf: Don't refcount LISTEN sockets in sk_assign()
Avoid taking a reference on listen sockets by checking the socket type
in the sk_assign and in the corresponding skb_steal_sock() code in the
the transport layer, and by ensuring that the prefetch free (sock_pfree)
function uses the same logic to check whether the socket is refcounted.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
2020-03-30 13:45:05 -07:00
Joe Stringer
71489e21d7 net: Track socket refcounts in skb_steal_sock()
Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
the determination of whether the socket is reference counted in the case
where it is prefetched by earlier logic such as early_demux.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
2020-03-30 13:45:04 -07:00
Joe Stringer
cf7fbe660f bpf: Add socket assign support
Add support for TPROXY via a new bpf helper, bpf_sk_assign().

This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route. The
simplest example route is below, but in practice you may want to route
traffic more narrowly (eg by CIDR):

  $ ip route add local default dev lo

This patch avoids trying to introduce an extra bit into the skb->sk, as
that would require more invasive changes to all code interacting with
the socket to ensure that the bit is handled correctly, such as all
error-handling cases along the path from the helper in BPF through to
the orphan path in the input. Instead, we opt to use the destructor
variable to switch on the prefetch of the socket.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
2020-03-30 13:45:04 -07:00
Linus Torvalds
481ed297d9 This has been a busy cycle for documentation work. Highlights include:
- Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
     Maybe someday we'll get to the end of this stuff...maybe...
 
   - Some organizational work to bring some order to the core-api manual.
 
   - Various new docs and additions to the existing documentation.
 
   - Typo fixes, warning fixes, ...
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl6BLf4PHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YLhkIAIhcg6gxp0oZZ3KDfQyhvej0EWQGVDNkmloQ
 O1VOSV3RJsZL9HwN9xSNnNfN5+hw5RUYVbn1s201uj6kovZY9qcTpHP2LCizUeGb
 eFkSTmzkyAuAbJjuVLgMPDerJPEew0HnudiToeSpQeoIL1WB6YGd4/5H/cN1KLex
 8ggjllcY0wOgbiFffmK6+tavDv7vT0lKTdwKRYh2nxu7zrPVVd1ZnW+RtntdTVQt
 i+xwV6/YdWtg5C53IwBPpeyubX40vqaIjU8rzpLq5SCVbsZN14sSR709m1AYCOK0
 i4VDWEhfA2XBi6Nycl5U0czuGziwoHrTgSCkS1mmSDujnpgfKM8=
 =6YOS
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.7' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "This has been a busy cycle for documentation work.

  Highlights include:

   - Lots of RST conversion work by Mauro, Daniel ALmeida, and others.
     Maybe someday we'll get to the end of this stuff...maybe...

   - Some organizational work to bring some order to the core-api
     manual.

   - Various new docs and additions to the existing documentation.

   - Typo fixes, warning fixes, ..."

* tag 'docs-5.7' of git://git.lwn.net/linux: (123 commits)
  Documentation: x86: exception-tables: document CONFIG_BUILDTIME_TABLE_SORT
  MAINTAINERS: adjust to filesystem doc ReST conversion
  docs: deprecated.rst: Add BUG()-family
  doc: zh_CN: add translation for virtiofs
  doc: zh_CN: index files in filesystems subdirectory
  docs: locking: Drop :c:func: throughout
  docs: locking: Add 'need' to hardirq section
  docs: conf.py: avoid thousands of duplicate label warning on Sphinx
  docs: prevent warnings due to autosectionlabel
  docs: fix reference to core-api/namespaces.rst
  docs: fix pointers to io-mapping.rst and io_ordering.rst files
  Documentation: Better document the softlockup_panic sysctl
  docs: hw-vuln: tsx_async_abort.rst: get rid of an unused ref
  docs: perf: imx-ddr.rst: get rid of a warning
  docs: filesystems: fuse.rst: supress a Sphinx warning
  docs: translations: it: avoid duplicate refs at programming-language.rst
  docs: driver.rst: supress two ReSt warnings
  docs: trace: events.rst: convert some new stuff to ReST format
  Documentation: Add io_ordering.rst to driver-api manual
  Documentation: Add io-mapping.rst to driver-api manual
  ...
2020-03-30 12:45:23 -07:00
Linus Torvalds
e59cd88028 for-5.7/io_uring-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP
 fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH
 ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS
 tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs
 nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk
 F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm
 Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON
 k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4
 zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv
 dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo
 Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx
 xVbzeR9UIw==
 =bYLJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Here are the io_uring changes for this merge window. Light on new
  features this time around (just splice + buffer selection), lots of
  cleanups, fixes, and improvements to existing support. In particular,
  this contains:

   - Cleanup fixed file update handling for stack fallback (Hillf)

   - Re-work of how pollable async IO is handled, we no longer require
     thread offload to handle that. Instead we rely using poll to drive
     this, with task_work execution.

   - In conjunction with the above, allow expendable buffer selection,
     so that poll+recv (for example) no longer has to be a split
     operation.

   - Make sure we honor RLIMIT_FSIZE for buffered writes

   - Add support for splice (Pavel)

   - Linked work inheritance fixes and optimizations (Pavel)

   - Async work fixes and cleanups (Pavel)

   - Improve io-wq locking (Pavel)

   - Hashed link write improvements (Pavel)

   - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"

* tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
  io_uring: cleanup io_alloc_async_ctx()
  io_uring: fix missing 'return' in comment
  io-wq: handle hashed writes in chains
  io-uring: drop 'free_pfile' in struct io_file_put
  io-uring: drop completion when removing file
  io_uring: Fix ->data corruption on re-enqueue
  io-wq: close cancel gap for hashed linked work
  io_uring: make spdxcheck.py happy
  io_uring: honor original task RLIMIT_FSIZE
  io-wq: hash dependent work
  io-wq: split hashing and enqueueing
  io-wq: don't resched if there is no work
  io-wq: remove duplicated cancel code
  io_uring: fix truncated async read/readv and write/writev retry
  io_uring: dual license io_uring.h uapi header
  io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
  io_uring: Fix unused function warnings
  io_uring: add end-of-bits marker and build time verify it
  io_uring: provide means of removing buffers
  io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
  ...
2020-03-30 12:18:49 -07:00
Haishuang Yan
e19680f834 ipvs: fix uninitialized variable warning
If outer_proto is not set, GCC warning as following:

In file included from net/netfilter/ipvs/ip_vs_core.c:52:
net/netfilter/ipvs/ip_vs_core.c: In function 'ip_vs_in_icmp':
include/net/ip_vs.h:233:4: warning: 'outer_proto' may be used uninitialized in this function [-Wmaybe-uninitialized]
 233 |    printk(KERN_DEBUG pr_fmt(msg), ##__VA_ARGS__); \
     |    ^~~~~~
net/netfilter/ipvs/ip_vs_core.c:1666:8: note: 'outer_proto' was declared here
1666 |  char *outer_proto;
     |        ^~~~~~~~~~~

Fixes: 73348fed35 ("ipvs: optimize tunnel dumps for icmp errors")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 21:17:53 +02:00
Sergey Marinkevich
2e34328b39 netfilter: nft_exthdr: fix endianness of tcp option cast
I got a problem on MIPS with Big-Endian is turned on: every time when
NF trying to change TCP MSS it returns because of new.v16 was greater
than old.v16. But real MSS was 1460 and my rule was like this:

	add rule table chain tcp option maxseg size set 1400

And 1400 is lesser that 1460, not greater.

Later I founded that main causer is cast from u32 to __be16.

Debugging:

In example MSS = 1400(HEX: 0x578). Here is representation of each byte
like it is in memory by addresses from left to right(e.g. [0x0 0x1 0x2
0x3]). LE — Little-Endian system, BE — Big-Endian, left column is type.

	     LE               BE
	u32: [78 05 00 00]    [00 00 05 78]

As you can see, u32 representation will be casted to u16 from different
half of 4-byte address range. But actually nf_tables uses registers and
store data of various size. Actually TCP MSS stored in 2 bytes. But
registers are still u32 in definition:

	struct nft_regs {
		union {
			u32			data[20];
			struct nft_verdict	verdict;
		};
	};

So, access like regs->data[priv->sreg] exactly u32. So, according to
table presents above, per-byte representation of stored TCP MSS in
register will be:

	                     LE               BE
	(u32)regs->data[]:   [78 05 00 00]    [05 78 00 00]
	                                       ^^ ^^

We see that register uses just half of u32 and other 2 bytes may be
used for some another data. But in nft_exthdr_tcp_set_eval() it casted
just like u32 -> __be16:

	new.v16 = src

But u32 overfill __be16, so it get 2 low bytes. For clarity draw
one more table(<xx xx> means that bytes will be used for cast).

	                     LE                 BE
	u32:                 [<78 05> 00 00]    [00 00 <05 78>]
	(u32)regs->data[]:   [<78 05> 00 00]    [05 78 <00 00>]

As you can see, for Little-Endian nothing changes, but for Big-endian we
take the wrong half. In my case there is some other data instead of
zeros, so new MSS was wrongly greater.

For shooting this bug I used solution for ports ranges. Applying of this
patch does not affect Little-Endian systems.

Signed-off-by: Sergey Marinkevich <sergey.marinkevich@eltex-co.ru>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 21:17:53 +02:00
David S. Miller
033c6f3b78 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-03-29

Here are a few more Bluetooth patches for the 5.7 kernel:

 - Fix assumption of encryption key size when reading fails
 - Add support for DEFER_SETUP with L2CAP Enhanced Credit Based Mode
 - Fix issue with auto-connected devices
 - Fix suspend handling when entering the state fails
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:49:14 -07:00
Vladimir Oltean
342971766c net: dsa: add port policers
The approach taken to pass the port policer methods on to drivers is
pragmatic. It is similar to the port mirroring implementation (in that
the DSA core does all of the filter block interaction and only passes
simple operations for the driver to implement) and dissimilar to how
flow-based policers are going to be implemented (where the driver has
full control over the flow_cls_offload data structure).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:44:00 -07:00
Vladimir Oltean
e13c207528 net: dsa: refactor matchall mirred action to separate function
Make room for other actions for the matchall filter by keeping the
mirred argument parsing self-contained in its own function.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:44:00 -07:00
Eran Ben Elisha
48bb52c80b devlink: Add auto dump flag to health reporter
On low memory system, run time dumps can consume too much memory. Add
administrator ability to disable auto dumps per reporter as part of the
error flow handle routine.

This attribute is not relevant while executing
DEVLINK_CMD_HEALTH_REPORTER_DUMP_GET.

By default, auto dump is activated for any reporter that has a dump method,
as part of the reporter registration to devlink.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:17:34 -07:00
Eran Ben Elisha
ba7d16c779 devlink: Implicitly set auto recover flag when registering health reporter
When health reporter is registered to devlink, devlink will implicitly set
auto recover if and only if the reporter has a recover method. No reason
to explicitly get the auto recover flag from the driver.

Remove this flag from all drivers that called
devlink_health_reporter_create.

All existing health reporters set auto recovery to true if they have a
recover method.

Yet, administrator can unset auto recover via netlink command as prior to
this patch.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:17:34 -07:00
Jiri Pirko
054eae8253 net: devlink: use NL_SET_ERR_MSG_MOD instead of NL_SET_ERR_MSG
The rest of the devlink code sets the extack message using
NL_SET_ERR_MSG_MOD. Change the existing appearances of NL_SET_ERR_MSG
to NL_SET_ERR_MSG_MOD.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:07:39 -07:00
Jiri Pirko
93a129eb8c net: sched: expose HW stats types per action used by drivers
It may be up to the driver (in case ANY HW stats is passed) to select
which type of HW stats he is going to use. Add an infrastructure to
expose this information to user.

$ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
$ tc -s filter show dev enp3s0np1 ingress
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.1.1
  in_hw in_hw_count 2
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 10 sec used 10 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        used_hw_stats immediate     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:06:49 -07:00
Jiri Pirko
8953b0770f net: introduce nla_put_bitfield32() helper and use it
Introduce a helper to pass value and selector to. The helper packs them
into struct and puts them into netlink message.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:06:49 -07:00
David S. Miller
acc086bfb9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-03-28

1) Use kmem_cache_zalloc() instead of kmem_cache_alloc()
   in xfrm_state_alloc(). From Huang Zijiang.

2) esp_output_fill_trailer() is the same in IPv4 and IPv6,
   so share this function to avoide code duplcation.
   From Raed Salem.

3) Add offload support for esp beet mode.
   From Xin Long.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 10:59:20 -07:00
Christophe JAILLET
ee91a83e08 net: dsa: Simplify 'dsa_tag_protocol_to_str()'
There is no point in preparing the module name in a buffer. The format
string can be passed diectly to 'request_module()'.

This axes a few lines of code and cleans a few things:
   - max len for a driver name is MODULE_NAME_LEN wich is ~ 60 chars,
     not 128. It would be down-sized in 'request_module()'
   - we should pass the total size of the buffer to 'snprintf()', not the
     size minus 1

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 10:54:57 -07:00
Florian Westphal
cf673ed0e0 net: fix fraglist segmentation reference count leak
Xin Long says:
 On udp rx path udp_rcv_segment() may do segment where the frag skbs
 will get the header copied from the head skb in skb_segment_list()
 by calling __copy_skb_header(), which could overwrite the frag skbs'
 extensions by __skb_ext_copy() and cause a leak.

 This issue was found after loading esp_offload where a sec path ext
 is set in the skb.

Fix this by discarding head state of the fraglist skb before replacing
its contents.

Fixes: 3a1296a38d ("net: Support GRO/GSO fraglist chaining.")
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Reported-by: Xiumei Mu <xmu@redhat.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 10:41:58 -07:00
Xin Long
bde1b56f89 udp: initialize is_flist with 0 in udp_gro_receive
Without NAPI_GRO_CB(skb)->is_flist initialized, when the dev doesn't
support NETIF_F_GRO_FRAGLIST, is_flist can still be set and fraglist
will be used in udp_gro_receive().

So fix it by initializing is_flist with 0 in udp_gro_receive.

Fixes: 9fd1ff5d2a ("udp: Support UDP fraglist GRO/GSO.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 10:35:03 -07:00
Ilya Dryomov
bb0e681dda libceph: directly skip to the end of redirect reply
Coverity complains about a double write to *p.  Don't bother with
osd_instructions and directly skip to the end of redirect reply.

Reported-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Ilya Dryomov
4d8b8fb494 libceph: simplify ceph_monc_handle_map()
ceph_monc_handle_map() confuses static checkers which report a
false use-after-free on monc->monmap, missing that monc->monmap and
client->monc.monmap is the same pointer.

Use monc->monmap consistently and get rid of "old", which is redundant.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:41 +02:00
Xiubo Li
5107d7d505 ceph: move ceph_osdc_{read,write}pages to ceph.ko
Since these helpers are only used by ceph.ko, move them there and
rename them with _sync_ qualifiers.

Signed-off-by: Xiubo Li <xiubli@redhat.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-03-30 12:42:40 +02:00
Ilya Dryomov
072eaf3c0f libceph: drop CEPH_DEFINE_SHOW_FUNC
Although CEPH_DEFINE_SHOW_FUNC is much older, it now duplicates
DEFINE_SHOW_ATTRIBUTE from linux/seq_file.h.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
2020-03-30 12:42:40 +02:00
Michal Kubecek
5b071c59ed ethtool: provide timestamping information with TSINFO_GET request
Implement TSINFO_GET request to get timestamping information for a network
device. This is traditionally available via ETHTOOL_GET_TS_INFO ioctl
request.

Move part of ethtool_get_ts_info() into common.c so that ioctl and netlink
code use the same logic to get timestamping information from the device.

v3: use "TSINFO" rather than "TIMESTAMP", suggested by Richard Cochran

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:37 -07:00
Michal Kubecek
f76510b458 ethtool: add timestamping related string sets
Add three string sets related to timestamping information:

  ETH_SS_SOF_TIMESTAMPING: SOF_TIMESTAMPING_* flags
  ETH_SS_TS_TX_TYPES:      timestamping Tx types
  ETH_SS_TS_RX_FILTERS:    timestamping Rx filters

These will be used for TIMESTAMP_GET request.

v2: avoid compiler warning ("enumeration value not handled in switch")
    in net_hwtstamp_validate()

v3: omit dash in Tx type names ("one-step-*" -> "onestep-*"), suggested by
    Richard Cochran

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
6c5bc8fe4e ethtool: add EEE_NTF notification
Send ETHTOOL_MSG_EEE_NTF notification whenever EEE settings of a network
device are modified using ETHTOOL_MSG_EEE_SET netlink message or
ETHTOOL_SEEE ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
fd77be7bd4 ethtool: set EEE settings with EEE_SET request
Implement EEE_SET netlink request to set EEE settings of a network device.
These are traditionally set with ETHTOOL_SEEE ioctl request.

The netlink interface allows setting the EEE status for all link modes
supported by kernel but only first 32 link modes can be set at the moment
as only those are supported by the ethtool_ops callback.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
b7eeefe72e ethtool: provide EEE settings with EEE_GET request
Implement EEE_GET request to get EEE settings of a network device. These
are traditionally available via ETHTOOL_GEEE ioctl request.

The netlink interface allows reporting EEE status for all link modes
supported by kernel but only first 32 link modes are provided at the moment
as only those are reported by the ethtool_ops callback and drivers.

v2: fix alignment (whitespace only)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
bf37faa386 ethtool: add PAUSE_NTF notification
Send ETHTOOL_MSG_PAUSE_NTF notification whenever pause parameters of
a network device are modified using ETHTOOL_MSG_PAUSE_SET netlink message
or ETHTOOL_SPAUSEPARAM ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
3ab879933d ethtool: set pause parameters with PAUSE_SET request
Implement PAUSE_SET netlink request to set pause parameters of a network
device. Thease are traditionally set with ETHTOOL_SPAUSEPARAM ioctl
request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
7f59fb32b0 ethtool: provide pause parameters with PAUSE_GET request
Implement PAUSE_GET request to get pause parameters of a network device.
These are traditionally available via ETHTOOL_GPAUSEPARAM ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
0cf3eac8c9 ethtool: add COALESCE_NTF notification
Send ETHTOOL_MSG_COALESCE_NTF notification whenever coalescing parameters
of a network device are modified using ETHTOOL_MSG_COALESCE_SET netlink
message or ETHTOOL_SCOALESCE ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
9881418c75 ethtool: set coalescing parameters with COALESCE_SET request
Implement COALESCE_SET netlink request to set coalescing parameters of
a network device. These are traditionally set with ETHTOOL_SCOALESCE ioctl
request. This commit adds only support for device coalescing parameters,
not per queue coalescing parameters.

Like the ioctl implementation, the generic ethtool code checks if only
supported parameters are modified; if not, first offending attribute is
reported using extack.

v2: fix alignment (whitespace only)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
217275453b ethtool: provide coalescing parameters with COALESCE_GET request
Implement COALESCE_GET request to get coalescing parameters of a network
device. These are traditionally available via ETHTOOL_GCOALESCE ioctl
request. This commit adds only support for device coalescing parameters,
not per queue coalescing parameters.

Omit attributes with zero values unless they are declared as supported
(i.e. the corresponding bit in ethtool_ops::supported_coalesce_params is
set).

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Michal Kubecek
b51fb7711a ethtool: fix reference leak in ethnl_set_privflags()
Andrew noticed that some handlers for *_SET commands leak a netdev
reference if required ethtool_ops callbacks do not exist. One of them is
ethnl_set_privflags(), a simple reproducer would be e.g.

  ip link add veth1 type veth peer name veth2
  ethtool --set-priv-flags veth1 foo on
  ip link del veth1

Make sure dev_put() is called when ethtool_ops check fails.

Fixes: f265d79959 ("ethtool: set device private flags with PRIVFLAGS_SET request")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:32:36 -07:00
Alexander Aring
a7a29f9c36 net: ipv6: add rpl sr tunnel
This patch adds functionality to configure routes for RPL source routing
functionality. There is no IPIP functionality yet implemented which can
be added later when the cases when to use IPv6 encapuslation comes more
clear.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Alexander Aring
faee676944 net: add net available in build_state
The build_state callback of lwtunnel doesn't contain the net namespace
structure yet. This patch will add it so we can check on specific
address configuration at creation time of rpl source routes.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Alexander Aring
8610c7c6e3 net: ipv6: add support for rpl sr exthdr
This patch adds rpl source routing receive handling. Everything works
only if sysconf "rpl_seg_enabled" and source routing is enabled. Mostly
the same behaviour as IPv6 segmentation routing. To handle compression
and uncompression a rpl.c file is created which contains the necessary
functionality. The receive handling will also care about IPv6
encapsulated so far it's specified as possible nexthdr in RFC 6554.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Alexander Aring
f37c605936 addrconf: add functionality to check on rpl requirements
This patch adds a functionality to addrconf to check on a specific RPL
address configuration. According to RFC 6554:

To detect loops in the SRH, a router MUST determine if the SRH
includes multiple addresses assigned to any interface on that
router. If such addresses appear more than once and are separated by
at least one address not assigned to that router.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
William Dauchy
25629fdaff net, ip_tunnel: fix interface lookup with no key
when creating a new ipip interface with no local/remote configuration,
the lookup is done with TUNNEL_NO_KEY flag, making it impossible to
match the new interface (only possible match being fallback or metada
case interface); e.g: `ip link add tunl1 type ipip dev eth0`

To fix this case, adding a flag check before the key comparison so we
permit to match an interface with no local/remote config; it also avoids
breaking possible userland tools relying on TUNNEL_NO_KEY flag and
uninitialised key.

context being on my side, I'm creating an extra ipip interface attached
to the physical one, and moving it to a dedicated namespace.

Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:16:27 -07:00
Paolo Abeni
01cacb00b3 mptcp: add netlink-based PM
Expose a new netlink family to userspace to control the PM, setting:

 - list of local addresses to be signalled.
 - list of local addresses used to created subflows.
 - maximum number of add_addr option to react

When the msk is fully established, the PM netlink attempts to
announce the 'signal' list via the ADD_ADDR option. Since we
currently lack the ADD_ADDR echo (and related event) only the
first addr is sent.

After exhausting the 'announce' list, the PM tries to create
subflow for each addr in 'local' list, waiting for each
connection to be completed before attempting the next one.

Idea is to add an additional PM hook for ADD_ADDR echo, to allow
the PM netlink announcing multiple addresses, in sequence.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:49 -07:00
Florian Westphal
fc518953bc mptcp: add and use MIB counter infrastructure
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s"
or "nstat" will show them automatically.

The MPTCP MIB counters are allocated in a distinct pcpu area in order to
avoid bloating/wasting TCP pcpu memory.

Counters are allocated once the first MPTCP socket is created in a
network namespace and free'd on exit.

If no sockets have been allocated, all-zero mptcp counters are shown.

The MIB counter list is taken from the multipath-tcp.org kernel, but
only a few counters have been picked up so far.  The counter list can
be increased at any time later on.

v2 -> v3:
 - remove 'inline' in foo.c files (David S. Miller)

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:49 -07:00
Davide Caratti
5147dfb508 mptcp: allow dumping subflow context to userspace
add ulp-specific diagnostic functions, so that subflow information can be
dumped to userspace programs like 'ss'.

v2 -> v3:
- uapi: use bit macros appropriate for userspace

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Paolo Abeni
3b1d6210a9 mptcp: implement and use MPTCP-level retransmission
On timeout event, schedule a work queue to do the retransmission.
Retransmission code closely resembles the sendmsg() implementation and
re-uses mptcp_sendmsg_frag, providing a dummy msghdr - for flags'
sake - and peeking the relevant dfrag from the rtx head.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Paolo Abeni
3f8e0aae17 mptcp: rework mptcp_sendmsg_frag to accept optional dfrag
This will simplify mptcp-level retransmission implementation
in the next patch. If dfrag is provided by the caller, skip
kernel space memory allocation and use data and metadata
provided by the dfrag itself.

Because a peer could ack data at TCP level but refrain from
sending mptcp-level ACKs, we could grow the mptcp socket
backlog indefinitely.

We should thus block mptcp_sendmsg until the peer has acked some of the
sent data.

In order to be able to do so, increment the mptcp socket wmem_queued
counter on memory allocation and decrement it when releasing the memory
on mptcp-level ack reception.

Because TCP performns sndbuf auto-tuning up to tcp_wmem_max[2], make
this the mptcp sk_sndbuf limit.

In the future we could add experiment with autotuning as TCP does in
tcp_sndbuf_expand().

v2 -> v3:
 - remove 'inline' in foo.c files (David S. Miller)

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Florian Westphal
7948f6cc99 mptcp: allow partial cleaning of rtx head dfrag
After adding wmem accounting for the mptcp socket we could get
into a situation where the mptcp socket can't transmit more data,
and mptcp_clean_una doesn't reduce wmem even if snd_una has advanced
because it currently will only remove entire dfrags.

Allow advancing the dfrag head sequence and reduce wmem,
even though this isn't correct (as we can't release the page).

Because we will soon block on mptcp sk in case wmem is too large,
call sk_stream_write_space() in case we reduced the backlog so
userspace task blocked in sendmsg or poll will be woken up.

This isn't an issue if the send buffer is large, but it is when
SO_SNDBUF is used to reduce it to a lower value.

Note we can still get a deadlock for low SO_SNDBUF values in
case both sides of the connection write to the socket: both could
be blocked due to wmem being too small -- and current mptcp stack
will only increment mptcp ack_seq on recv.

This doesn't happen with the selftest as it uses poll() and
will always call recv if there is data to read.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Paolo Abeni
d027236c41 mptcp: implement memory accounting for mptcp rtx queue
Charge the data on the rtx queue to the master MPTCP socket, too.
Such memory in uncharged when the data is acked/dequeued.

Also account mptcp sockets inuse via a protocol specific pcpu
counter.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Paolo Abeni
b51f9b80c0 mptcp: introduce MPTCP retransmission timer
The timer will be used to schedule retransmission. It's
frequency is based on the current subflow RTO estimation and
is reset on every una_seq update

The timer is clearer for good by __mptcp_clear_xmit()

Also clean MPTCP rtx queue before each transmission.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Paolo Abeni
18b683bff8 mptcp: queue data for mptcp level retransmission
Keep the send page fragment on an MPTCP level retransmission queue.
The queue entries are allocated inside the page frag allocator,
acquiring an additional reference to the page for each list entry.

Also switch to a custom page frag refill function, to ensure that
the current page fragment can always host an MPTCP rtx queue entry.

The MPTCP rtx queue is flushed at disconnect() and close() time

Note that now we need to call __mptcp_init_sock() regardless of mptcp
enable status, as the destructor will try to walk the rtx_queue.

v2 -> v3:
 - remove 'inline' in foo.c files (David S. Miller)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Paolo Abeni
cc9d256698 mptcp: update per unacked sequence on pkt reception
So that we keep per unacked sequence number consistent; since
we update per msk data, use an atomic64 cmpxchg() to protect
against concurrent updates from multiple subflows.

Initialize the snd_una at connect()/accept() time.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
926bdeab55 mptcp: Implement path manager interface commands
Fill in more path manager functionality by adding a worker function and
modifying the related stub functions to schedule the worker.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
ec3edaa7ca mptcp: Add handling of outgoing MP_JOIN requests
Subflow creation may be initiated by the path manager when
the primary connection is fully established and a remote
address has been received via ADD_ADDR.

Create an in-kernel sock and use kernel_connect() to
initiate connection.

Passive sockets can't acquire the mptcp socket lock at
subflow creation time, so an additional list protected by
a new spinlock is used to track the MPJ subflows.

Such list is spliced into conn_list tail every time the msk
socket lock is acquired, so that it will not interfere
with data flow on the original connection.

Data flow and connection failover not addressed by this commit.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
f296234c98 mptcp: Add handling of incoming MP_JOIN requests
Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
1b1c7a0ef7 mptcp: Add path manager interface
Add enough of a path manager interface to allow sending of ADD_ADDR
when an incoming MPTCP connection is created. Capable of sending only
a single IPv4 ADD_ADDR option. The 'pm_data' element of the connection
sock will need to be expanded to handle multiple interfaces and IPv6.
Partial processing of the incoming ADD_ADDR is included so the path
manager notification of that event happens at the proper time, which
involves validating the incoming address information.

This is a skeleton interface definition for events generated by
MPTCP.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
3df523ab58 mptcp: Add ADD_ADDR handling
Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Marcelo Ricardo Leitner
582eea2305 sctp: fix possibly using a bad saddr with a given dst
Under certain circumstances, depending on the order of addresses on the
interfaces, it could be that sctp_v[46]_get_dst() would return a dst
with a mismatched struct flowi.

For example, if when walking through the bind addresses and the first
one is not a match, it saves the dst as a fallback (added in
410f03831c), but not the flowi. Then if the next one is also not a
match, the previous dst will be returned but with the flowi information
for the 2nd address, which is wrong.

The fix is to use a locally stored flowi that can be used for such
attempts, and copy it to the parameter only in case it is a possible
match, together with the corresponding dst entry.

The patch updates IPv6 code mostly just to be in sync. Even though the issue
is also present there, it fallback is not expected to work with IPv6.

Fixes: 410f03831c ("sctp: add routing output fallback")
Reported-by: Jin Meng <meng.a.jin@nokia-sbell.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Tested-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:01:43 -07:00
Qiujun Huang
5c3e82fe15 sctp: fix refcount bug in sctp_wfree
We should iterate over the datamsgs to move
all chunks(skbs) to newsk.

The following case cause the bug:
for the trouble SKB, it was in outq->transmitted list

sctp_outq_sack
        sctp_check_transmitted
                SKB was moved to outq->sacked list
        then throw away the sack queue
                SKB was deleted from outq->sacked
(but it was held by datamsg at sctp_datamsg_to_asoc
So, sctp_wfree was not called here)

then migrate happened

        sctp_for_each_tx_datachunk(
        sctp_clear_owner_w);
        sctp_assoc_migrate();
        sctp_for_each_tx_datachunk(
        sctp_set_owner_w);
SKB was not in the outq, and was not changed to newsk

finally

__sctp_outq_teardown
        sctp_chunk_put (for another skb)
                sctp_datamsg_put
                        __kfree_skb(msg->frag_list)
                                sctp_wfree (for SKB)
	SKB->sk was still oldsk (skb->sk != asoc->base.sk).

Reported-and-tested-by: syzbot+cea71eec5d6de256d54d@syzkaller.appspotmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Acked-by: Marcelo Ricardo Leitner <mleitner@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:57:30 -07:00
Cambda Zhu
a08e7fd912 net: Fix typo of SKB_SGO_CB_OFFSET
The SKB_SGO_CB_OFFSET should be SKB_GSO_CB_OFFSET which means the
offset of the GSO in skb cb. This patch fixes the typo.

Fixes: 9207f9d45b ("net: preserve IP control block during GSO segmentation")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:53:18 -07:00
Qian Cai
fbe4e0c1b2 ipv4: fix a RCU-list lock in fib_triestat_seq_show
fib_triestat_seq_show() calls hlist_for_each_entry_rcu(tb, head,
tb_hlist) without rcu_read_lock() will trigger a warning,

 net/ipv4/fib_trie.c:2579 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by proc01/115277:
  #0: c0000014507acf00 (&p->lock){+.+.}-{3:3}, at: seq_read+0x58/0x670

 Call Trace:
  dump_stack+0xf4/0x164 (unreliable)
  lockdep_rcu_suspicious+0x140/0x164
  fib_triestat_seq_show+0x750/0x880
  seq_read+0x1a0/0x670
  proc_reg_read+0x10c/0x1b0
  __vfs_read+0x3c/0x70
  vfs_read+0xac/0x170
  ksys_read+0x7c/0x140
  system_call+0x5c/0x68

Fix it by adding a pair of rcu_read_lock/unlock() and use
cond_resched_rcu() to avoid the situation where walking of a large
number of items  may prevent scheduling for a long time.

Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:51:28 -07:00
Denis Kirjanov
798dda818a net: page pool: allow to pass zero flags to page_pool_init()
page pool API can be useful for non-DMA cases like
xen-netfront driver so let's allow to pass zero flags to
page pool flags.

v2: check DMA direction only if PP_FLAG_DMA_MAP is set

Signed-off-by: Denis Kirjanov <kda@linux-powerpc.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:49:20 -07:00
Johannes Berg
be8c827f50 mac80211: fix authentication with iwlwifi/mvm
The original patch didn't copy the ieee80211_is_data() condition
because on most drivers the management frames don't go through
this path. However, they do on iwlwifi/mvm, so we do need to keep
the condition here.

Cc: stable@vger.kernel.org
Fixes: ce2e1ca703 ("mac80211: Check port authorization in the ieee80211_tx_dequeue() case")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:27:44 -07:00
David S. Miller
f0b5989745 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor comment conflict in mac80211.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 21:25:29 -07:00
wenxu
ef803b3cf9 netfilter: flowtable: add counter support in HW offload
Store the conntrack counters to the conntrack entry in the
HW flowtable offload.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:40 +02:00
wenxu
9312eabab4 netfilter: conntrack: add nf_ct_acct_add()
Add nf_ct_acct_add function to update the conntrack counter
with packets and bytes.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:39 +02:00
Pablo Neira Ayuso
d56aab2625 netfilter: nf_tables: skip set types that do not support for expressions
The bitmap set does not support for expressions, skip it from the
estimation step.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:39 +02:00
Pablo Neira Ayuso
8548bde989 netfilter: nft_dynset: validate set expression definition
If the global set expression definition mismatches the dynset
expression, then bail out.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:38 +02:00
Pablo Neira Ayuso
24791b9aa1 netfilter: nft_set_bitmap: initialize set element extension in lookups
Otherwise, nft_lookup might dereference an uninitialized pointer to the
element extension.

Fixes: 665153ff57 ("netfilter: nf_tables: add bitmap set type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:37 +02:00
Romain Bellan
7c6b412162 netfilter: ctnetlink: be more strict when NF_CONNTRACK_MARK is not set
When CONFIG_NF_CONNTRACK_MARK is not set, any CTA_MARK or CTA_MARK_MASK
in netlink message are not supported. We should return an error when one
of them is set, not both

Fixes: 9306425b70 ("netfilter: ctnetlink: must check mark attributes vs NULL")
Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:36 +02:00
Florian Westphal
28f715b9e6 netfilter: nf_queue: prefer nf_queue_entry_free
Instead of dropping refs+kfree, use the helper added in previous patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29 16:28:29 +02:00
Florian Westphal
af370ab36f netfilter: nf_queue: do not release refcouts until nf_reinject is done
nf_queue is problematic when another NF_QUEUE invocation happens
from nf_reinject().

1. nf_queue is invoked, increments state->sk refcount.
2. skb is queued, waiting for verdict.
3. sk is closed/released.
3. verdict comes back, nf_reinject is called.
4. nf_reinject drops the reference -- refcount can now drop to 0

Instead of get_ref/release_ref pattern, we need to nest the get_ref calls:
    get_ref
       get_ref
       release_ref
     release_ref

So that when we invoke the next processing stage (another netfilter
or the okfn()), we hold at least one reference count on the
devices/socket.

After previous patch, it is now safe to put the entry even after okfn()
has potentially free'd the skb.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29 16:28:29 +02:00
Florian Westphal
119e52e664 netfilter: nf_queue: place bridge physports into queue_entry struct
The refcount is done via entry->skb, which does work fine.
Major problem: When putting the refcount of the bridge ports, we
must always put the references while the skb is still around.

However, we will need to put the references after okfn() to avoid
a possible 1 -> 0 -> 1 refcount transition, so we cannot use the
skb pointer anymore.

Place the physports in the queue entry structure instead to allow
for refcounting changes in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29 16:28:29 +02:00
Florian Westphal
dd3cc111f2 netfilter: nf_queue: make nf_queue_entry_release_refs static
This is a preparation patch, no logical changes.
Move free_entry into core and rename it to something more sensible.

Will ease followup patches which will complicate the refcount handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29 16:28:29 +02:00
Waiman Long
d3ec10aa95 KEYS: Don't write out to userspace while holding key semaphore
A lockdep circular locking dependency report was seen when running a
keyutils test:

[12537.027242] ======================================================
[12537.059309] WARNING: possible circular locking dependency detected
[12537.088148] 4.18.0-147.7.1.el8_1.x86_64+debug #1 Tainted: G OE    --------- -  -
[12537.125253] ------------------------------------------------------
[12537.153189] keyctl/25598 is trying to acquire lock:
[12537.175087] 000000007c39f96c (&mm->mmap_sem){++++}, at: __might_fault+0xc4/0x1b0
[12537.208365]
[12537.208365] but task is already holding lock:
[12537.234507] 000000003de5b58d (&type->lock_class){++++}, at: keyctl_read_key+0x15a/0x220
[12537.270476]
[12537.270476] which lock already depends on the new lock.
[12537.270476]
[12537.307209]
[12537.307209] the existing dependency chain (in reverse order) is:
[12537.340754]
[12537.340754] -> #3 (&type->lock_class){++++}:
[12537.367434]        down_write+0x4d/0x110
[12537.385202]        __key_link_begin+0x87/0x280
[12537.405232]        request_key_and_link+0x483/0xf70
[12537.427221]        request_key+0x3c/0x80
[12537.444839]        dns_query+0x1db/0x5a5 [dns_resolver]
[12537.468445]        dns_resolve_server_name_to_ip+0x1e1/0x4d0 [cifs]
[12537.496731]        cifs_reconnect+0xe04/0x2500 [cifs]
[12537.519418]        cifs_readv_from_socket+0x461/0x690 [cifs]
[12537.546263]        cifs_read_from_socket+0xa0/0xe0 [cifs]
[12537.573551]        cifs_demultiplex_thread+0x311/0x2db0 [cifs]
[12537.601045]        kthread+0x30c/0x3d0
[12537.617906]        ret_from_fork+0x3a/0x50
[12537.636225]
[12537.636225] -> #2 (root_key_user.cons_lock){+.+.}:
[12537.664525]        __mutex_lock+0x105/0x11f0
[12537.683734]        request_key_and_link+0x35a/0xf70
[12537.705640]        request_key+0x3c/0x80
[12537.723304]        dns_query+0x1db/0x5a5 [dns_resolver]
[12537.746773]        dns_resolve_server_name_to_ip+0x1e1/0x4d0 [cifs]
[12537.775607]        cifs_reconnect+0xe04/0x2500 [cifs]
[12537.798322]        cifs_readv_from_socket+0x461/0x690 [cifs]
[12537.823369]        cifs_read_from_socket+0xa0/0xe0 [cifs]
[12537.847262]        cifs_demultiplex_thread+0x311/0x2db0 [cifs]
[12537.873477]        kthread+0x30c/0x3d0
[12537.890281]        ret_from_fork+0x3a/0x50
[12537.908649]
[12537.908649] -> #1 (&tcp_ses->srv_mutex){+.+.}:
[12537.935225]        __mutex_lock+0x105/0x11f0
[12537.954450]        cifs_call_async+0x102/0x7f0 [cifs]
[12537.977250]        smb2_async_readv+0x6c3/0xc90 [cifs]
[12538.000659]        cifs_readpages+0x120a/0x1e50 [cifs]
[12538.023920]        read_pages+0xf5/0x560
[12538.041583]        __do_page_cache_readahead+0x41d/0x4b0
[12538.067047]        ondemand_readahead+0x44c/0xc10
[12538.092069]        filemap_fault+0xec1/0x1830
[12538.111637]        __do_fault+0x82/0x260
[12538.129216]        do_fault+0x419/0xfb0
[12538.146390]        __handle_mm_fault+0x862/0xdf0
[12538.167408]        handle_mm_fault+0x154/0x550
[12538.187401]        __do_page_fault+0x42f/0xa60
[12538.207395]        do_page_fault+0x38/0x5e0
[12538.225777]        page_fault+0x1e/0x30
[12538.243010]
[12538.243010] -> #0 (&mm->mmap_sem){++++}:
[12538.267875]        lock_acquire+0x14c/0x420
[12538.286848]        __might_fault+0x119/0x1b0
[12538.306006]        keyring_read_iterator+0x7e/0x170
[12538.327936]        assoc_array_subtree_iterate+0x97/0x280
[12538.352154]        keyring_read+0xe9/0x110
[12538.370558]        keyctl_read_key+0x1b9/0x220
[12538.391470]        do_syscall_64+0xa5/0x4b0
[12538.410511]        entry_SYSCALL_64_after_hwframe+0x6a/0xdf
[12538.435535]
[12538.435535] other info that might help us debug this:
[12538.435535]
[12538.472829] Chain exists of:
[12538.472829]   &mm->mmap_sem --> root_key_user.cons_lock --> &type->lock_class
[12538.472829]
[12538.524820]  Possible unsafe locking scenario:
[12538.524820]
[12538.551431]        CPU0                    CPU1
[12538.572654]        ----                    ----
[12538.595865]   lock(&type->lock_class);
[12538.613737]                                lock(root_key_user.cons_lock);
[12538.644234]                                lock(&type->lock_class);
[12538.672410]   lock(&mm->mmap_sem);
[12538.687758]
[12538.687758]  *** DEADLOCK ***
[12538.687758]
[12538.714455] 1 lock held by keyctl/25598:
[12538.732097]  #0: 000000003de5b58d (&type->lock_class){++++}, at: keyctl_read_key+0x15a/0x220
[12538.770573]
[12538.770573] stack backtrace:
[12538.790136] CPU: 2 PID: 25598 Comm: keyctl Kdump: loaded Tainted: G
[12538.844855] Hardware name: HP ProLiant DL360 Gen9/ProLiant DL360 Gen9, BIOS P89 12/27/2015
[12538.881963] Call Trace:
[12538.892897]  dump_stack+0x9a/0xf0
[12538.907908]  print_circular_bug.isra.25.cold.50+0x1bc/0x279
[12538.932891]  ? save_trace+0xd6/0x250
[12538.948979]  check_prev_add.constprop.32+0xc36/0x14f0
[12538.971643]  ? keyring_compare_object+0x104/0x190
[12538.992738]  ? check_usage+0x550/0x550
[12539.009845]  ? sched_clock+0x5/0x10
[12539.025484]  ? sched_clock_cpu+0x18/0x1e0
[12539.043555]  __lock_acquire+0x1f12/0x38d0
[12539.061551]  ? trace_hardirqs_on+0x10/0x10
[12539.080554]  lock_acquire+0x14c/0x420
[12539.100330]  ? __might_fault+0xc4/0x1b0
[12539.119079]  __might_fault+0x119/0x1b0
[12539.135869]  ? __might_fault+0xc4/0x1b0
[12539.153234]  keyring_read_iterator+0x7e/0x170
[12539.172787]  ? keyring_read+0x110/0x110
[12539.190059]  assoc_array_subtree_iterate+0x97/0x280
[12539.211526]  keyring_read+0xe9/0x110
[12539.227561]  ? keyring_gc_check_iterator+0xc0/0xc0
[12539.249076]  keyctl_read_key+0x1b9/0x220
[12539.266660]  do_syscall_64+0xa5/0x4b0
[12539.283091]  entry_SYSCALL_64_after_hwframe+0x6a/0xdf

One way to prevent this deadlock scenario from happening is to not
allow writing to userspace while holding the key semaphore. Instead,
an internal buffer is allocated for getting the keys out from the
read method first before copying them out to userspace without holding
the lock.

That requires taking out the __user modifier from all the relevant
read methods as well as additional changes to not use any userspace
write helpers. That is,

  1) The put_user() call is replaced by a direct copy.
  2) The copy_to_user() call is replaced by memcpy().
  3) All the fault handling code is removed.

Compiling on a x86-64 system, the size of the rxrpc_read() function is
reduced from 3795 bytes to 2384 bytes with this patch.

Fixes: ^1da177e4c3f4 ("Linux-2.6.12-rc2")
Reviewed-by: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-29 12:40:41 +01:00
Linus Torvalds
e595dd9451 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix memory leak in vti6, from Torsten Hilbrich.

 2) Fix double free in xfrm_policy_timer, from YueHaibing.

 3) NL80211_ATTR_CHANNEL_WIDTH attribute is put with wrong type, from
    Johannes Berg.

 4) Wrong allocation failure check in qlcnic driver, from Xu Wang.

 5) Get ks8851-ml IO operations right, for real this time, from Marek
    Vasut.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (22 commits)
  r8169: fix PHY driver check on platforms w/o module softdeps
  net: ks8851-ml: Fix IO operations, again
  mlxsw: spectrum_mr: Fix list iteration in error path
  qlcnic: Fix bad kzalloc null test
  mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX
  mac80211: mark station unauthorized before key removal
  mac80211: Check port authorization in the ieee80211_tx_dequeue() case
  cfg80211: Do not warn on same channel at the end of CSA
  mac80211: drop data frames without key on encrypted links
  ieee80211: fix HE SPR size calculation
  nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type
  xfrm: policy: Fix doulbe free in xfrm_policy_timer
  bpf: Explicitly memset some bpf info structures declared on the stack
  bpf: Explicitly memset the bpf_attr structure
  bpf: Sanitize the bpf_struct_ops tcp-cc name
  vti6: Fix memory leak of skb if input policy check fails
  esp: remove the skb from the chain when it's enqueued in cryptd_wq
  ipv6: xfrm6_tunnel.c: Use built-in RCU list checking
  xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire
  xfrm: fix uctx len check in verify_sec_ctx_len
  ...
2020-03-28 18:55:15 -07:00
Toke Høiland-Jørgensen
92234c8f15 xdp: Support specifying expected existing program when attaching XDP
While it is currently possible for userspace to specify that an existing
XDP program should not be replaced when attaching to an interface, there is
no mechanism to safely replace a specific XDP program with another.

This patch adds a new netlink attribute, IFLA_XDP_EXPECTED_FD, which can be
set along with IFLA_XDP_FD. If set, the kernel will check that the program
currently loaded on the interface matches the expected one, and fail the
operation if it does not. This corresponds to a 'cmpxchg' memory operation.
Setting the new attribute with a negative value means that no program is
expected to be attached, which corresponds to setting the UPDATE_IF_NOEXIST
flag.

A new companion flag, XDP_FLAGS_REPLACE, is also added to explicitly
request checking of the EXPECTED_FD attribute. This is needed for userspace
to discover whether the kernel supports the new attribute.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/bpf/158515700640.92963.3551295145441017022.stgit@toke.dk
2020-03-28 14:24:41 -07:00
Jean-Philippe Menil
e9ff9d5254 bpf: Fix build warning regarding missing prototypes
Fix build warnings when building net/bpf/test_run.o with W=1 due
to missing prototype for bpf_fentry_test{1..6}.

Instead of declaring prototypes, turn off warnings with
__diag_{push,ignore,pop} as pointed out by Alexei.

Signed-off-by: Jean-Philippe Menil <jpmenil@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200327204713.28050-1-jpmenil@gmail.com
2020-03-28 18:13:18 +01:00
Trond Myklebust
f764a1e1cb NFSoRDMA Client Updates for Linux 5.7
New Features:
 - Allow one active connection and several zombie connections to prevent
   blocking if the remote server is unresponsive.
 
 Bugfixes and Cleanups:
 - Enhance MR-related trace points
 - Refactor connection set-up and disconnect functions
 - Make Protection Domains per-connection instead of per-transport
 - Merge struct rpcrdma_ia into rpcrdma_ep
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5+GKEACgkQ18tUv7Cl
 QOvJVBAA4MNSvzo0856X414uUMe1YK1HXblqYbNWSe0bRsfaakp8mscMFmd4/IuG
 YSOlQPYcTErdfeEelS1ayMZA3qy5Ke62hKFFfv0fk11lN3VoT4XPiIqY6Ry2OJaR
 7XoYyH3YKJxgqOYhQwSo4rvFRNtPKbvEtcQn8uLXnMclomlZ1mli4F9rYs23SvLL
 Jbq1q/SKx0w+GAOUj418dYLjMnQU7L+TotplTBGzE7mkhBR0JarBiKZfLxbg2/md
 Fva1EwY4Tgef+4ofVr5KP8raGVGl6zy8TIZezlijeUv4rmk9JudrYjeUNqyCWufw
 ffkrM+wPr8kQViQoei5ByEvVG2pQK9nfxccx6rxiIOoG8Mb1aK8Wh75e2kjfZZeY
 bsEx8wSfRvI54ukk/NfGpP21OP1DDKU4ET6hsW+b6onqmDfYPZ/LinUvrvIHietQ
 IsMOG+QE+G1mXFiQR1P9dZcIwTNvE/SyUoaykbeslcYCkQt5yEk25UK+38TornbV
 hBxrm9KBd25RUu8prFJadmLBjPx3tLLd451xxppCyrIlpw0b+Q4yHlF7Jr75DYBT
 xjW5Xpr6tJZuoybbSyT9B/tiV4gIfyJZCU5S95nFlluKZ0gQd2jIGj/JagNWppke
 XSEqRR/Ja0OnB7yDz60gQrrZnjw/Em2jx7xbHIXFK+rs45dkdFc=
 =piIM
 -----END PGP SIGNATURE-----

Merge tag 'nfs-rdma-for-5.7-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

NFSoRDMA Client Updates for Linux 5.7

New Features:
- Allow one active connection and several zombie connections to prevent
  blocking if the remote server is unresponsive.

Bugfixes and Cleanups:
- Enhance MR-related trace points
- Refactor connection set-up and disconnect functions
- Make Protection Domains per-connection instead of per-transport
- Merge struct rpcrdma_ia into rpcrdma_ep
2020-03-28 12:01:17 -04:00
Daniel Borkmann
834ebca845 bpf: Enable retrival of pid/tgid/comm from bpf cgroup hooks
We already have the bpf_get_current_uid_gid() helper enabled, and
given we now have perf event RB output available for connect(),
sendmsg(), recvmsg() and bind-related hooks, add a trivial change
to enable bpf_get_current_pid_tgid() and bpf_get_current_comm()
as well.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/18744744ed93c06343be8b41edcfd858706f39d7.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:39 -07:00
Daniel Borkmann
0f09abd105 bpf: Enable bpf cgroup hooks to retrieve cgroup v2 and ancestor id
Enable the bpf_get_current_cgroup_id() helper for connect(), sendmsg(),
recvmsg() and bind-related hooks in order to retrieve the cgroup v2
context which can then be used as part of the key for BPF map lookups,
for example. Given these hooks operate in process context 'current' is
always valid and pointing to the app that is performing mentioned
syscalls if it's subject to a v2 cgroup. Also with same motivation of
commit 7723628101 ("bpf: Introduce bpf_skb_ancestor_cgroup_id helper")
enable retrieval of ancestor from current so the cgroup id can be used
for policy lookups which can then forbid connect() / bind(), for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/d2a7ef42530ad299e3cbb245e6c12374b72145ef.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:39 -07:00
Daniel Borkmann
5a52ae4e32 bpf: Allow to retrieve cgroup v1 classid from v2 hooks
Today, Kubernetes is still operating on cgroups v1, however, it is
possible to retrieve the task's classid based on 'current' out of
connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators
which attach to the root cgroup v2 hook in a mixed env like in case
of Cilium, for example, in order to then correlate certain pod traffic
and use it as part of the key for BPF map lookups.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
Daniel Borkmann
f318903c0b bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

  (*) Both externalTrafficPolicy={Local|Cluster} types
  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
Daniel Borkmann
fcf752ea8c bpf: Enable perf event rb output for bpf cgroup progs
Currently, connect(), sendmsg(), recvmsg() and bind-related hooks
are all lacking perf event rb output in order to push notifications
or monitoring events up to user space. Back in commit a5a3a828cd
("bpf: add perf event notificaton support for sock_ops"), I've worked
with Sowmini to enable them for sock_ops where the context part is
not used (as opposed to skbs for example where the packet data can
be appended). Make the bpf_sockopt_event_output() helper generic and
enable it for mentioned hooks.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/69c39daf87e076b31e52473c902e9bfd37559124.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
Daniel Borkmann
0e53d9e5e8 bpf: Enable retrieval of socket cookie for bind/post-bind hook
We currently make heavy use of the socket cookie in BPF's connect(),
sendmsg() and recvmsg() hooks for load-balancing decisions. However,
it is currently not enabled/implemented in BPF {post-}bind hooks
where it can later be used in combination for correlation in the tc
egress path, for example.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/e9d71f310715332f12d238cc650c1edc5be55119.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
David S. Miller
a0ba26f37e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-03-27

The following pull-request contains BPF updates for your *net* tree.

We've added 3 non-merge commits during the last 4 day(s) which contain
a total of 4 files changed, 25 insertions(+), 20 deletions(-).

The main changes are:

1) Explicitly memset the bpf_attr structure on bpf() syscall to avoid
   having to rely on compiler to do so. Issues have been noticed on
   some compilers with padding and other oddities where the request was
   then unexpectedly rejected, from Greg Kroah-Hartman.

2) Sanitize the bpf_struct_ops TCP congestion control name in order to
   avoid problematic characters such as whitespaces, from Martin KaFai Lau.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:18:51 -07:00
Vladimir Oltean
bff33f7e2a net: dsa: implement auto-normalization of MTU for bridge hardware datapath
Many switches don't have an explicit knob for configuring the MTU
(maximum transmission unit per interface).  Instead, they do the
length-based packet admission checks on the ingress interface, for
reasons that are easy to understand (why would you accept a packet in
the queuing subsystem if you know you're going to drop it anyway).

So it is actually the MRU that these switches permit configuring.

In Linux there only exists the IFLA_MTU netlink attribute and the
associated dev_set_mtu function. The comments like to play blind and say
that it's changing the "maximum transfer unit", which is to say that
there isn't any directionality in the meaning of the MTU word. So that
is the interpretation that this patch is giving to things: MTU == MRU.

When 2 interfaces having different MTUs are bridged, the bridge driver
MTU auto-adjustment logic kicks in: what br_mtu_auto_adjust() does is it
adjusts the MTU of the bridge net device itself (and not that of the
slave net devices) to the minimum value of all slave interfaces, in
order for forwarded packets to not exceed the MTU regardless of the
interface they are received and send on.

The idea behind this behavior, and why the slave MTUs are not adjusted,
is that normal termination from Linux over the L2 forwarding domain
should happen over the bridge net device, which _is_ properly limited by
the minimum MTU. And termination over individual slave devices is
possible even if those are bridged. But that is not "forwarding", so
there's no reason to do normalization there, since only a single
interface sees that packet.

The problem with those switches that can only control the MRU is with
the offloaded data path, where a packet received on an interface with
MRU 9000 would still be forwarded to an interface with MRU 1500. And the
br_mtu_auto_adjust() function does not really help, since the MTU
configured on the bridge net device is ignored.

In order to enforce the de-facto MTU == MRU rule for these switches, we
need to do MTU normalization, which means: in order for no packet larger
than the MTU configured on this port to be sent, then we need to limit
the MRU on all ports that this packet could possibly come from. AKA
since we are configuring the MRU via MTU, it means that all ports within
a bridge forwarding domain should have the same MTU.

And that is exactly what this patch is trying to do.

>From an implementation perspective, we try to follow the intent of the
user, otherwise there is a risk that we might livelock them (they try to
change the MTU on an already-bridged interface, but we just keep
changing it back in an attempt to keep the MTU normalized). So the MTU
that the bridge is normalized to is either:

 - The most recently changed one:

   ip link set dev swp0 master br0
   ip link set dev swp1 master br0
   ip link set dev swp0 mtu 1400

   This sequence will make swp1 inherit MTU 1400 from swp0.

 - The one of the most recently added interface to the bridge:

   ip link set dev swp0 master br0
   ip link set dev swp1 mtu 1400
   ip link set dev swp1 master br0

   The above sequence will make swp0 inherit MTU 1400 as well.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:07:25 -07:00
Vladimir Oltean
bfcb813203 net: dsa: configure the MTU for switch ports
It is useful be able to configure port policers on a switch to accept
frames of various sizes:

- Increase the MTU for better throughput from the default of 1500 if it
  is known that there is no 10/100 Mbps device in the network.
- Decrease the MTU to limit the latency of high-priority frames under
  congestion, or work around various network segments that add extra
  headers to packets which can't be fragmented.

For DSA slave ports, this is mostly a pass-through callback, called
through the regular ndo ops and at probe time (to ensure consistency
across all supported switches).

The CPU port is called with an MTU equal to the largest configured MTU
of the slave ports. The assumption is that the user might want to
sustain a bidirectional conversation with a partner over any switch
port.

The DSA master is configured the same as the CPU port, plus the tagger
overhead. Since the MTU is by definition L2 payload (sans Ethernet
header), it is up to each individual driver to figure out if it needs to
do anything special for its frame tags on the CPU port (it shouldn't
except in special cases). So the MTU does not contain the tagger
overhead on the CPU port.
However the MTU of the DSA master, minus the tagger overhead, is used as
a proxy for the MTU of the CPU port, which does not have a net device.
This is to avoid uselessly calling the .change_mtu function on the CPU
port when nothing should change.

So it is safe to assume that the DSA master and the CPU port MTUs are
apart by exactly the tagger's overhead in bytes.

Some changes were made around dsa_master_set_mtu(), function which was
now removed, for 2 reasons:
  - dev_set_mtu() already calls dev_validate_mtu(), so it's redundant to
    do the same thing in DSA
  - __dev_set_mtu() returns 0 if ops->ndo_change_mtu is an absent method
That is to say, there's no need for this function in DSA, we can safely
call dev_set_mtu() directly, take the rtnl lock when necessary, and just
propagate whatever errors get reported (since the user probably wants to
be informed).

Some inspiration (mainly in the MTU DSA notifier) was taken from a
vaguely similar patch from Murali and Florian, who are credited as
co-developers down below.

Co-developed-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Signed-off-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Co-developed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:07:24 -07:00
David S. Miller
e00dd941ff Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2020-03-27

1) Handle NETDEV_UNREGISTER for xfrm device to handle asynchronous
   unregister events cleanly. From Raed Salem.

2) Fix vti6 tunnel inter address family TX through bpf_redirect().
   From Nicolas Dichtel.

3) Fix lenght check in verify_sec_ctx_len() to avoid a
   slab-out-of-bounds. From Xin Long.

4) Add a missing verify_sec_ctx_len check in xfrm_add_acquire
   to avoid a possible out-of-bounds to access. From Xin Long.

5) Use built-in RCU list checking of hlist_for_each_entry_rcu
   to silence false lockdep warning in __xfrm6_tunnel_spi_lookup
   when CONFIG_PROVE_RCU_LIST is enabled. From Madhuparna Bhowmik.

6) Fix a panic on esp offload when crypto is done asynchronously.
   From Xin Long.

7) Fix a skb memory leak in an error path of vti6_rcv.
   From Torsten Hilbrich.

8) Fix a race that can lead to a doulbe free in xfrm_policy_timer.
   From Xin Long.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 14:56:55 -07:00
Paul Blakey
7da182a998 netfilter: flowtable: Use work entry per offload command
To allow offload commands to execute in parallel, create workqueue
for flow table offload, and use a work entry per offload command.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:42:21 +01:00
Paul Blakey
422c032afc netfilter: flowtable: Use rw sem as flow block lock
Currently flow offload threads are synchronized by the flow block mutex.
Use rw lock instead to increase flow insertion (read) concurrency.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:42:20 +01:00
Qian Cai
0a6a9515fe netfilter: nf_tables: silence a RCU-list warning in nft_table_lookup()
It is safe to traverse &net->nft.tables with &net->nft.commit_mutex
held using list_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false
positive,

WARNING: suspicious RCU usage
net/netfilter/nf_tables_api.c:523 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by iptables/1384:
 #0: ffffffff9745c4a8 (&net->nft.commit_mutex){+.+.}, at: nf_tables_valid_genid+0x25/0x60 [nf_tables]

Call Trace:
 dump_stack+0xa1/0xea
 lockdep_rcu_suspicious+0x103/0x10d
 nft_table_lookup.part.0+0x116/0x120 [nf_tables]
 nf_tables_newtable+0x12c/0x7d0 [nf_tables]
 nfnetlink_rcv_batch+0x559/0x1190 [nfnetlink]
 nfnetlink_rcv+0x1da/0x210 [nfnetlink]
 netlink_unicast+0x306/0x460
 netlink_sendmsg+0x44b/0x770
 ____sys_sendmsg+0x46b/0x4a0
 ___sys_sendmsg+0x138/0x1a0
 __sys_sendmsg+0xb6/0x130
 __x64_sys_sendmsg+0x48/0x50
 do_syscall_64+0x69/0xf4
 entry_SYSCALL_64_after_hwframe+0x49/0xb3

Signed-off-by: Qian Cai <cai@lca.pw>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:42:20 +01:00
wenxu
133a2fe594 netfilter: flowtable: Fix incorrect tc_setup_type type
The indirect block setup should use TC_SETUP_FT as the type instead of
TC_SETUP_BLOCK. Adjust existing users of the indirect flow block
infrastructure.

Fixes: b5140a36da ("netfilter: flowtable: add indr block setup support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:41:52 +01:00
Pablo Neira Ayuso
53c2b2899a netfilter: flowtable: add counter support
Add a new flag to turn on flowtable counters which are stored in the
conntrack entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:32:37 +01:00
Pablo Neira Ayuso
cfbd1125fc netfilter: nf_tables: add enum nft_flowtable_flags to uapi
Expose the NFT_FLOWTABLE_HW_OFFLOAD flag through uapi.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:32:36 +01:00
Pablo Neira Ayuso
8ac2bd3577 netfilter: conntrack: export nf_ct_acct_update()
This function allows you to update the conntrack counters.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:32:36 +01:00
Haishuang Yan
73348fed35 ipvs: optimize tunnel dumps for icmp errors
After strip GRE/UDP tunnel header for icmp errors, it's better to show
"GRE/UDP" instead of "IPIP" in debug message.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:31:01 +01:00
Jules Irenge
6b36d4829c netfilter: conntrack: Add missing annotations for nf_conntrack_all_lock() and nf_conntrack_all_unlock()
Sparse reports warnings at nf_conntrack_all_lock()
	and nf_conntrack_all_unlock()

warning: context imbalance in nf_conntrack_all_lock()
	- wrong count at exit
warning: context imbalance in nf_conntrack_all_unlock()
	- unexpected unlock

Add the missing __acquires(&nf_conntrack_locks_all_lock)
Add missing __releases(&nf_conntrack_locks_all_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:22:06 +01:00
Jules Irenge
19f8f717f6 netfilter: ctnetlink: Add missing annotation for ctnetlink_parse_nat_setup()
Sparse reports a warning at ctnetlink_parse_nat_setup()

warning: context imbalance in ctnetlink_parse_nat_setup()
	- unexpected unlock

The root cause is the missing annotation at ctnetlink_parse_nat_setup()
Add the missing __must_hold(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:21:09 +01:00
Chuck Lever
1a33d8a284 svcrdma: Fix leak of transport addresses
Kernel memory leak detected:

unreferenced object 0xffff888849cdf480 (size 8):
  comm "kworker/u8:3", pid 2086, jiffies 4297898756 (age 4269.856s)
  hex dump (first 8 bytes):
    30 00 cd 49 88 88 ff ff                          0..I....
  backtrace:
    [<00000000acfc370b>] __kmalloc_track_caller+0x137/0x183
    [<00000000a2724354>] kstrdup+0x2b/0x43
    [<0000000082964f84>] xprt_rdma_format_addresses+0x114/0x17d [rpcrdma]
    [<00000000dfa6ed00>] xprt_setup_rdma_bc+0xc0/0x10c [rpcrdma]
    [<0000000073051a83>] xprt_create_transport+0x3f/0x1a0 [sunrpc]
    [<0000000053531a8e>] rpc_create+0x118/0x1cd [sunrpc]
    [<000000003a51b5f8>] setup_callback_client+0x1a5/0x27d [nfsd]
    [<000000001bd410af>] nfsd4_process_cb_update.isra.7+0x16c/0x1ac [nfsd]
    [<000000007f4bbd56>] nfsd4_run_cb_work+0x4c/0xbd [nfsd]
    [<0000000055c5586b>] process_one_work+0x1b2/0x2fe
    [<00000000b1e3e8ef>] worker_thread+0x1a6/0x25a
    [<000000005205fb78>] kthread+0xf6/0xfb
    [<000000006d2dc057>] ret_from_fork+0x3a/0x50

Introduce a call to xprt_rdma_free_addresses() similar to the way
that the TCP backchannel releases a transport's peer address
strings.

Fixes: 5d252f90a8 ("svcrdma: Add class for RDMA backwards direction transport")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27 12:25:06 -04:00
Christophe JAILLET
b25b60d7bf SUNRPC: Fix a potential buffer overflow in 'svc_print_xprts()'
'maxlen' is the total size of the destination buffer. There is only one
caller and this value is 256.

When we compute the size already used and what we would like to add in
the buffer, the trailling NULL character is not taken into account.
However, this trailling character will be added by the 'strcat' once we
have checked that we have enough place.

So, there is a off-by-one issue and 1 byte of the stack could be
erroneously overwridden.

Take into account the trailling NULL, when checking if there is enough
place in the destination buffer.


While at it, also replace a 'sprintf' by a safer 'snprintf', check for
output truncation and avoid a superfluous 'strlen'.

Fixes: dc9a16e49d ("svc: Add /proc/sys/sunrpc/transport files")
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
[ cel: very minor fix to documenting comment
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-27 12:22:57 -04:00
Chuck Lever
e28ce90083 xprtrdma: kmalloc rpcrdma_ep separate from rpcrdma_xprt
Change the rpcrdma_xprt_disconnect() function so that it no longer
waits for the DISCONNECTED event.  This prevents blocking if the
remote is unresponsive.

In rpcrdma_xprt_disconnect(), the transport's rpcrdma_ep is
detached. Upon return from rpcrdma_xprt_disconnect(), the transport
(r_xprt) is ready immediately for a new connection.

The RDMA_CM_DEVICE_REMOVAL and RDMA_CM_DISCONNECTED events are now
handled almost identically.

However, because the lifetimes of rpcrdma_xprt structures and
rpcrdma_ep structures are now independent, creating an rpcrdma_ep
needs to take a module ref count. The ep now owns most of the
hardware resources for a transport.

Also, a kref is needed to ensure that rpcrdma_ep sticks around
long enough for the cm_event_handler to finish.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
745b734c9b xprtrdma: Extract sockaddr from struct rdma_cm_id
rpcrdma_cm_event_handler() is always passed an @id pointer that is
valid. However, in a subsequent patch, we won't be able to extract
an r_xprt in every case. So instead of using the r_xprt's
presentation address strings, extract them from struct rdma_cm_id.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
93aa8e0a9d xprtrdma: Merge struct rpcrdma_ia into struct rpcrdma_ep
I eventually want to allocate rpcrdma_ep separately from struct
rpcrdma_xprt so that on occasion there can be more than one ep per
xprt.

The new struct rpcrdma_ep will contain all the fields currently in
rpcrdma_ia and in rpcrdma_ep. This is all the device and CM settings
for the connection, in addition to per-connection settings
negotiated with the remote.

Take this opportunity to rename the existing ep fields from rep_* to
re_* to disambiguate these from struct rpcrdma_rep.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
d6ccebf956 xprtrdma: Disconnect on flushed completion
Completion errors after a disconnect often occur much sooner than a
CM_DISCONNECT event. Use this to try to detect connection loss more
quickly.

Note that other kernel ULPs do take care to disconnect explicitly
when a WR is flushed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
897b7be9bc xprtrdma: Remove rpcrdma_ia::ri_flags
Clean up:
The upper layer serializes calls to xprt_rdma_close, so there is no
need for an atomic bit operation, saving 8 bytes in rpcrdma_ia.

This enables merging rpcrdma_ia_remove directly into the disconnect
logic.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:25 -04:00
Chuck Lever
81fe0c57f4 xprtrdma: Invoke rpcrdma_ia_open in the connect worker
Move rdma_cm_id creation into rpcrdma_ep_create() so that it is now
responsible for allocating all per-connection hardware resources.

With this clean-up, all three arms of the switch statement in
rpcrdma_ep_connect are exactly the same now, thus the switch can be
removed.

Because device removal behaves a little differently than
disconnection, there is a little more work to be done before
rpcrdma_ep_destroy() can release the connection's rdma_cm_id. So
it is not quite symmetrical with rpcrdma_ep_create() yet.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
9ba373ee24 xprtrdma: Allocate Protection Domain in rpcrdma_ep_create()
Make a Protection Domain (PD) a per-connection resource rather than
a per-transport resource. In other words, when the connection
terminates, the PD is destroyed.

Thus there is one less HW resource that remains allocated to a
transport after a connection is closed.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
9144a803df xprtrdma: Refactor rpcrdma_ep_connect() and rpcrdma_ep_disconnect()
Clean up: Simplify the synopses of functions in the connect and
disconnect paths in preparation for combining the rpcrdma_ia and
struct rpcrdma_ep structures.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
97d0de8812 xprtrdma: Clean up the post_send path
Clean up: Simplify the synopses of functions in the post_send path
by combining the struct rpcrdma_ia and struct rpcrdma_ep arguments.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
253a51622f xprtrdma: Refactor frwr_init_mr()
Clean up: prepare for combining the rpcrdma_ia and rpcrdma_ep
structures. Take the opportunity to rename the function to be
consistent with the "subsystem _ object _ verb" naming scheme.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
85cd8e2b78 xprtrdma: Invoke rpcrdma_ep_create() in the connect worker
Refactor rpcrdma_ep_create(), rpcrdma_ep_disconnect(), and
rpcrdma_ep_destroy().

rpcrdma_ep_create will be invoked at connect time instead of at
transport set-up time. It will be responsible for allocating per-
connection resources. In this patch it allocates the CQs and
creates a QP. More to come.

rpcrdma_ep_destroy() is the inverse functionality that is
invoked at disconnect time. It will be responsible for releasing
the CQs and QP.

These changes should be safe to do because both connect and
disconnect is guaranteed to be serialized by the transport send
lock.

This takes us another step closer to resolving the address and route
only at connect time so that connection failover to another device
will work correctly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Chuck Lever
62a89501a3 xprtrdma: Enhance MR-related trace points
Two changes:
- Show the number of SG entries that were mapped. This helps debug
  DMA-related problems.
- Record the MR's resource ID instead of its memory address. This
  groups each MR with its associated rdma-tool output, and reduces
  needless exposure of memory addresses.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-03-27 10:47:24 -04:00
Sergey Alirzaev
388f6966b0 9pnet: allow making incomplete read requests
A user doesn't necessarily want to wait for all the requested data to
be available, since the waiting time for each request is unbounded.

The new method permits sending one read request at a time and getting
the response ASAP, allowing to use 9pnet with synthetic file systems
representing arbitrary data streams.

Link: http://lkml.kernel.org/r/20200205204053.12751-1-l29ah@cock.li
Signed-off-by: Sergey Alirzaev <l29ah@cock.li>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2020-03-27 09:29:56 +00:00
Antoine Tenart
5908220b2b net: introduce the MACSEC netdev feature
This patch introduce a new netdev feature, which will be used by drivers
to state they can perform MACsec transformations in hardware.

The patchset was gathered by Mark, macsec functinality itself
was implemented by Dmitry, Mark and Pavel Belous.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: Mark Starovoytov <mstarovoitov@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 20:17:36 -07:00
Maciej Żenczykowski
c24a77edc9 ipv6: ndisc: add support for 'PREF64' dns64 prefix identifier
This is trivial since we already have support for the entirely
identical (from the kernel's point of view) RDNSS, DNSSL, etc. that
also contain opaque data that needs to be passed down to userspace
for further processing.

As specified in draft-ietf-6man-ra-pref64-09 (while it is still a draft,
it is purely waiting on the RFC Editor for cleanups and publishing):
  PREF64 option contains lifetime and a (up to) 96-bit IPv6 prefix.

The 8-bit identifier of the option type as assigned by the IANA is 38.

Since we lack DNS64/NAT64/CLAT support in kernel at the moment,
thus this option should also be passed on to userland.

See:
  https://tools.ietf.org/html/draft-ietf-6man-ra-pref64-09
  https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml#icmpv6-parameters-5

Cc: Erik Kline <ek@google.com>
Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Michael Haro <mharo@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 20:05:58 -07:00
Guillaume Nault
e304e21a2b cls_flower: Add extack support for flags key
Pass extack down to fl_set_key_flags() and set message on error.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:52:31 -07:00
Guillaume Nault
bd7d4c1281 cls_flower: Add extack support for src and dst port range options
Pass extack down to fl_set_key_port_range() and set message on error.

Both the min and max ports would qualify as invalid attributes here.
Report the min one as invalid, as it's probably what makes the most
sense from a user point of view.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:52:31 -07:00
Guillaume Nault
442f730e48 cls_flower: Add extack support for mpls options
Pass extack down to fl_set_key_mpls() and set message on error.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:52:31 -07:00
Jacob Keller
b9a17abfde devlink: implement DEVLINK_CMD_REGION_NEW
Implement support for the DEVLINK_CMD_REGION_NEW command for creating
snapshots. This new command parallels the existing
DEVLINK_CMD_REGION_DEL.

In order for DEVLINK_CMD_REGION_NEW to work for a region, the new
".snapshot" operation must be implemented in the region's ops structure.

The desired snapshot id must be provided. This helps avoid confusion on
the purpose of DEVLINK_CMD_REGION_NEW, and keeps the API simpler.

The requested id will be inserted into the xarray tracking the number of
snapshots using each id. If this id is already used by another snapshot
on any region, an error will be returned.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
12102436ac devlink: track snapshot id usage count using an xarray
Each snapshot created for a devlink region must have an id. These ids
are supposed to be unique per "event" that caused the snapshot to be
created. Drivers call devlink_region_snapshot_id_get to obtain a new id
to use for a new event trigger. The id values are tracked per devlink,
so that the same id number can be used if a triggering event creates
multiple snapshots on different regions.

There is no mechanism for snapshot ids to ever be reused. Introduce an
xarray to store the count of how many snapshots are using a given id,
replacing the snapshot_id field previously used for picking the next id.

The devlink_region_snapshot_id_get() function will use xa_alloc to
insert an initial value of 1 value at an available slot between 0 and
U32_MAX.

The new __devlink_snapshot_id_increment() and
__devlink_snapshot_id_decrement() functions will be used to track how
many snapshots currently use an id.

Drivers must now call devlink_snapshot_id_put() in order to release
their reference of the snapshot id after adding region snapshots.

By tracking the total number of snapshots using a given id, it is
possible for the decrement() function to erase the id from the xarray
when it is not in use.

With this method, a snapshot id can become reused again once all
snapshots that referred to it have been deleted via
DEVLINK_CMD_REGION_DEL, and the driver has finished adding snapshots.

This work also paves the way to introduce a mechanism for userspace to
request a snapshot.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
7ef19d3b1d devlink: report error once U32_MAX snapshot ids have been used
The devlink_snapshot_id_get() function returns a snapshot id. The
snapshot id is a u32, so there is no way to indicate an error code.

A future change is going to possibly add additional cases where this
function could fail. Refactor the function to return the snapshot id in
an argument, so that it can return zero or an error value.

This ensures that snapshot ids cannot be confused with error values, and
aids in the future refactor of snapshot id allocation management.

Because there is no current way to release previously used snapshot ids,
add a simple check ensuring that an error is reported in case the
snapshot_id would over flow.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
7000108f27 devlink: extract snapshot id allocation to helper function
A future change is going to implement a new devlink command to request
a snapshot on demand. As part of this, the logic for handling the
snapshot ids will be refactored. To simplify the snapshot id allocation
function, move it to a separate function prefixed by `__`. This helper
function will assume the lock is held.

While no other callers will exist, it simplifies refactoring the logic
because there is no need to complicate the function with gotos to handle
unlocking on failure.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
47a39f6105 devlink: use -ENOSPC to indicate no more room for snapshots
The devlink_region_snapshot_create function returns -ENOMEM when the
maximum number of snapshots has been reached. This is confusing because
it is not an issue of being out of memory. Change this to use -ENOSPC
instead.

Reported-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
cf80faee79 devlink: add function to take snapshot while locked
A future change is going to add a new devlink command to request
a snapshot on demand. This function will want to call the
devlink_region_snapshot_create function while already holding the
devlink instance lock.

Extract the logic of this function into a static function prefixed by
`__` to indicate that it is an internal helper function. Modify the
original function to be implemented in terms of the new locked
function.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
6d82f67e25 devlink: trivial: fix tab in function documentation
The function documentation comment for devlink_region_snapshot_create
included a literal tab character between 'future analyses' that was
difficult to spot as it happened to only display as one space wide.

Fix the comment to use a space here instead of a stray tab appearing in
the middle of a sentence.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
a0a09f6bb2 devlink: convert snapshot destructor callback to region op
It does not makes sense that two snapshots for a given region would use
different destructors. Simplify snapshot creation by adding
a .destructor op for regions.

This operation will replace the data_destructor for the snapshot
creation, and makes snapshot creation easier.

Noticed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
e893768179 devlink: prepare to support region operations
Modify the devlink region code in preparation for adding new operations
on regions.

Create a devlink_region_ops structure, and move the name pointer from
within the devlink_region structure into the ops structure (similar to
the devlink_health_reporter_ops).

This prepares the regions to enable support of additional operations in
the future such as requesting snapshots, or accessing the region
directly without a snapshot.

In order to re-use the constant strings in the mlx4 driver their
declaration must be changed to 'const char * const' to ensure the
compiler realizes that both the data and the pointer cannot change.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Petr Machata
d4d9d9c53b sched: act_pedit: Implement stats_update callback
Implement this callback in order to get the offloaded stats added to the
kernel stats.

Reported-by: Alexander Petrovskiy <alexpe@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:20:37 -07:00
Petr Machata
837cb17dd6 sched: act_skbedit: Implement stats_update callback
Implement this callback in order to get the offloaded stats added to the
kernel stats.

Reported-by: Alexander Petrovskiy <alexpe@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:20:37 -07:00
Linus Torvalds
60268940cd A patch for a rather old regression in fullness handling and two memory
leak fixes, marked for stable.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl59DCwTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi5oGB/943a7gIBV52PD3MGCnI8RWjgHkk3d0
 en2JNI6i7hf7GD7GplMGkc0D8INBJhCZo1mwzX36QXYA3BeXKARkNXvEE+AZ4dX5
 XbUiPE5WuUwxcT9sE9rTiCurx1ToN/XUlA27Vbt9J67U08w5BjJ3utO1LuW7z2ME
 NPx6aw6tdwIEeNJBo4ge8y9vPKevtXqhkCbzSb2kn+tMhoMPuJ3RIj8kWIF7mYWZ
 ofwOFoDnOfQuH+9ZA/mT4jL7ifR0am5QptHSD9kxge2mKlc0pmoABZK6sWNPOslg
 jQaEiefH77K/IxRyAsQNM7iHbUzKpZGbqAHx92MU0redUjUWNdCDGUmF
 =c01Y
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.6-rc8' of git://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:
 "A patch for a rather old regression in fullness handling and two
  memory leak fixes, marked for stable"

* tag 'ceph-for-5.6-rc8' of git://github.com/ceph/ceph-client:
  ceph: fix memory leak in ceph_cleanup_snapid_map()
  libceph: fix alloc_msg_with_page_vector() memory leaks
  ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL
2020-03-26 15:44:41 -07:00
David S. Miller
328f5bb993 We have the following fixes:
* drop data packets if there's no key for them anymore, after
    there had been one, to avoid sending them in clear when
    hostapd removes the key before it removes the station and
    the packets are still queued
  * check port authorization again after dequeue, to avoid
    sending packets if the station is no longer authorized
  * actually remove the authorization flag before the key so
    packets are also dropped properly because of this
  * fix nl80211 control port packet tagging to handle them as
    packets allowed to go out without encryption
  * fix NL80211_ATTR_CHANNEL_WIDTH outgoing netlink attribute
    width (should be 32 bits, not 8)
  * don't WARN in a CSA scenario that happens on some APs
  * fix HE spatial reuse element size calculation
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl58xGkACgkQB8qZga/f
 l8QfVA//XNwrqiA8nqZ0ihStgtYqZgLHIoMmeYtPxf1QHgkuP5GG5AW3ctlJHUHK
 MQd+NqgAV67K2xmQPFwBrNACBF7r5NqDauiEscCJjlOXiaEWVbtJ47u49nFUgZZc
 YJlQNBVK0xhQkBfHlPNg+K5Phuiv+7gEmplOffDmvyrlWsxVxraqxpsEeXcFCNUh
 W8FDFVK0Rgw8fFnJchOB7hnFnj56JotpR+924nQxVy1HNKrOeCntaAi6N1I8PAxH
 GP4tZODflqWQ85apC4F2ev9MvGP7xiDsSf60Pcjho2XcgAnkxodP281XPv/tS9HV
 cDms68sOF4y/9xMucNgUKEqf3xYIOMk2JQ0KrflMtsv5v8miEL9BYmEGnd6zi4zL
 LBSjslhuYVd426Hg5Z23n/xst2tCP0Qs6Z2cYb9sv6vU/SW9VnESm2cL/3sbnAf8
 2y8zlhZUb+mi8w+I5exsS/pNf1AaNfBsz2rE+gENLMrPisqmi3kmwMSAEC+ORPFC
 Ld7rYWGJUyLUgNomJ0tBoRFv/GBFCmxz6IABr0UZ9I8G3y9N3NJ5MAMHGaf6dzx+
 YDGN388WZVBPjebSfShFsmGBbd9FkoQX5Csfwa7BtkmYhsAFWwh2sB8rz6y0Xc0h
 jKoY1cMsjTdof4Pm/OfsvDsuac+1/vXnNFR8W1Z5wDY3W1uj2OY=
 =pnuh
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-03-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
We have the following fixes:
 * drop data packets if there's no key for them anymore, after
   there had been one, to avoid sending them in clear when
   hostapd removes the key before it removes the station and
   the packets are still queued
 * check port authorization again after dequeue, to avoid
   sending packets if the station is no longer authorized
 * actually remove the authorization flag before the key so
   packets are also dropped properly because of this
 * fix nl80211 control port packet tagging to handle them as
   packets allowed to go out without encryption
 * fix NL80211_ATTR_CHANNEL_WIDTH outgoing netlink attribute
   width (should be 32 bits, not 8)
 * don't WARN in a CSA scenario that happens on some APs
 * fix HE spatial reuse element size calculation
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 12:03:02 -07:00
David S. Miller
14340219b8 mlx5-updates-2020-03-25
1) Cleanups from Dan Carpenter and wenxu.
 
 2) Paul and Roi, Some minor updates and fixes to E-Switch to address
 issues introduced in the previous reg_c0 updates series.
 
 3) Eli Cohen simplifies and improves flow steering matching group searches
 and flow table entries version management.
 
 4) Parav Pandit, improves devlink eswitch mode changes thread safety.
 By making devlink rely on driver for thread safety and introducing mlx5
 eswitch mode change protection.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl58SW8ACgkQSD+KveBX
 +j4AxQf8DdrFrBD0NFTcAILS4bnTJC0I3xKRPb/2oYtWLVyJ9G5XAZqHC0DAG7xs
 jy8xhIFbeUxgLEdcx0la5vdR1mPlzs4XBHTe99YwzwK/jojrA7YXrlb3kv+RXWVY
 uNVAby68wh4EnO61R51ahIBXLPNbiYpo/wAWKvvBKRkOcYMVTKIFiP157AnJWObY
 fxnt06I0NFaIX8Va4MEqkrmUYrI4jJcqOJC9FwRBLDhFHcFkLh0Gav3vJJ7M4BCB
 ggPJpuZ4pu43qX9TtSOm8V/GlWWN0RB7PdbvliFBEHYG21hf9MfE8bPPKRlB7CO+
 B5+9ULhpvbjX7yRrkZ7fd4zlQ1siew==
 =Flln
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2020-03-25' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-03-25

1) Cleanups from Dan Carpenter and wenxu.

2) Paul and Roi, Some minor updates and fixes to E-Switch to address
issues introduced in the previous reg_c0 updates series.

3) Eli Cohen simplifies and improves flow steering matching group searches
and flow table entries version management.

4) Parav Pandit, improves devlink eswitch mode changes thread safety.
By making devlink rely on driver for thread safety and introducing mlx5
eswitch mode change protection.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 11:38:48 -07:00
Hoang Le
8b1e5b0a99 tipc: Add a missing case of TIPC_DIRECT_MSG type
In the commit f73b12812a
("tipc: improve throughput between nodes in netns"), we're missing a check
to handle TIPC_DIRECT_MSG type, it's still using old sending mechanism for
this message type. So, throughput improvement is not significant as
expected.

Besides that, when sending a large message with that type, we're also
handle wrong receiving queue, it should be enqueued in socket receiving
instead of multicast messages.

Fix this by adding the missing case for TIPC_DIRECT_MSG.

Fixes: f73b12812a ("tipc: improve throughput between nodes in netns")
Reported-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 11:21:02 -07:00
Johannes Berg
b95d2ccd2c mac80211: set IEEE80211_TX_CTRL_PORT_CTRL_PROTO for nl80211 TX
When a frame is transmitted via the nl80211 TX rather than as a
normal frame, IEEE80211_TX_CTRL_PORT_CTRL_PROTO wasn't set and
this will lead to wrong decisions (rate control etc.) being made
about the frame; fix this.

Fixes: 9118064914 ("mac80211: Add support for tx_control_port")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200326155333.f183f52b02f0.I4054e2a8c11c2ddcb795a0103c87be3538690243@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-26 15:54:12 +01:00
Johannes Berg
b16798f5b9 mac80211: mark station unauthorized before key removal
If a station is still marked as authorized, mark it as no longer
so before removing its keys. This allows frames transmitted to it
to be rejected, providing additional protection against leaking
plain text data during the disconnection flow.

Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20200326155133.ccb4fb0bb356.If48f0f0504efdcf16b8921f48c6d3bb2cb763c99@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-26 15:52:25 +01:00
Jouni Malinen
ce2e1ca703 mac80211: Check port authorization in the ieee80211_tx_dequeue() case
mac80211 used to check port authorization in the Data frame enqueue case
when going through start_xmit(). However, that authorization status may
change while the frame is waiting in a queue. Add a similar check in the
dequeue case to avoid sending previously accepted frames after
authorization change. This provides additional protection against
potential leaking of frames after a station has been disconnected and
the keys for it are being removed.

Cc: stable@vger.kernel.org
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200326155133.ced84317ea29.I34d4c47cd8cc8a4042b38a76f16a601fbcbfd9b3@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-26 15:52:14 +01:00
Olga Kornievskaia
df513a7711 SUNRPC: fix krb5p mount to provide large enough buffer in rq_rcvsize
Ever since commit 2c94b8eca1 ("SUNRPC: Use au_rslack when computing
reply buffer size"). It changed how "req->rq_rcvsize" is calculated. It
used to use au_cslack value which was nice and large and changed it to
au_rslack value which turns out to be too small.

Since 5.1, v3 mount with sec=krb5p fails against an Ontap server
because client's receive buffer it too small.

For gss krb5p, we need to account for the mic token in the verifier,
and the wrap token in the wrap token.

RFC 4121 defines:
mic token
Octet no   Name        Description
         --------------------------------------------------------------
         0..1     TOK_ID     Identification field.  Tokens emitted by
                             GSS_GetMIC() contain the hex value 04 04
                             expressed in big-endian order in this
                             field.
         2        Flags      Attributes field, as described in section
                             4.2.2.
         3..7     Filler     Contains five octets of hex value FF.
         8..15    SND_SEQ    Sequence number field in clear text,
                             expressed in big-endian order.
         16..last SGN_CKSUM  Checksum of the "to-be-signed" data and
                             octet 0..15, as described in section 4.2.4.

that's 16bytes (GSS_KRB5_TOK_HDR_LEN) + chksum

wrap token
Octet no   Name        Description
         --------------------------------------------------------------
          0..1     TOK_ID    Identification field.  Tokens emitted by
                             GSS_Wrap() contain the hex value 05 04
                             expressed in big-endian order in this
                             field.
          2        Flags     Attributes field, as described in section
                             4.2.2.
          3        Filler    Contains the hex value FF.
          4..5     EC        Contains the "extra count" field, in big-
                             endian order as described in section 4.2.3.
          6..7     RRC       Contains the "right rotation count" in big-
                             endian order, as described in section
                             4.2.5.
          8..15    SND_SEQ   Sequence number field in clear text,
                             expressed in big-endian order.
          16..last Data      Encrypted data for Wrap tokens with
                             confidentiality, or plaintext data followed
                             by the checksum for Wrap tokens without
                             confidentiality, as described in section
                             4.2.4.

Also 16bytes of header (GSS_KRB5_TOK_HDR_LEN), encrypted data, and cksum
(other things like padding)

RFC 3961 defines known cksum sizes:
Checksum type              sumtype        checksum         section or
                                value            size         reference
   ---------------------------------------------------------------------
   CRC32                            1               4           6.1.3
   rsa-md4                          2              16           6.1.2
   rsa-md4-des                      3              24           6.2.5
   des-mac                          4              16           6.2.7
   des-mac-k                        5               8           6.2.8
   rsa-md4-des-k                    6              16           6.2.6
   rsa-md5                          7              16           6.1.1
   rsa-md5-des                      8              24           6.2.4
   rsa-md5-des3                     9              24             ??
   sha1 (unkeyed)                  10              20             ??
   hmac-sha1-des3-kd               12              20            6.3
   hmac-sha1-des3                  13              20             ??
   sha1 (unkeyed)                  14              20             ??
   hmac-sha1-96-aes128             15              20         [KRB5-AES]
   hmac-sha1-96-aes256             16              20         [KRB5-AES]
   [reserved]                  0x8003               ?         [GSS-KRB5]

Linux kernel now mainly supports type 15,16 so max cksum size is 20bytes.
(GSS_KRB5_MAX_CKSUM_LEN)

Re-use already existing define of GSS_KRB5_MAX_SLACK_NEEDED that's used
for encoding the gss_wrap tokens (same tokens are used in reply).

Fixes: 2c94b8eca1 ("SUNRPC: Use au_rslack when computing reply buffer size")
Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-26 10:51:01 -04:00
Ilan Peer
05dcb8bb25 cfg80211: Do not warn on same channel at the end of CSA
When cfg80211_update_assoc_bss_entry() is called, there is a
verification that the BSS channel actually changed. As some APs use
CSA also for bandwidth changes, this would result with a kernel
warning.

Fix this by removing the WARN_ON().

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.96316ada0e8d.I6710376b1b4257e5f4712fc7ab16e2b638d512aa@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-26 15:50:10 +01:00
Johannes Berg
a0761a3017 mac80211: drop data frames without key on encrypted links
If we know that we have an encrypted link (based on having had
a key configured for TX in the past) then drop all data frames
in the key selection handler if there's no key anymore.

This fixes an issue with mac80211 internal TXQs - there we can
buffer frames for an encrypted link, but then if the key is no
longer there when they're dequeued, the frames are sent without
encryption. This happens if a station is disconnected while the
frames are still on the TXQ.

Detecting that a link should be encrypted based on a first key
having been configured for TX is fine as there are no use cases
for a connection going from with encryption to no encryption.
With extended key IDs, however, there is a case of having a key
configured for only decryption, so we can't just trigger this
behaviour on a key being configured.

Cc: stable@vger.kernel.org
Reported-by: Jouni Malinen <j@w1.fi>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.6865c7f28a14.I9fb1d911b064262d33e33dfba730cdeef83926ca@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-26 15:49:24 +01:00
Xin Long
308491755f xfrm: add prep for esp beet mode offload
Like __xfrm_transport/mode_tunnel_prep(), this patch is to add
__xfrm_mode_beet_prep() to fix the transport_header for gso
segments, and reset skb mac_len, and pull skb data to the
proto inside esp.

This patch also fixes a panic, reported by ltp:

  # modprobe esp4_offload
  # runltp -f net_stress.ipsec_tcp

  [ 2452.780511] kernel BUG at net/core/skbuff.c:109!
  [ 2452.799851] Call Trace:
  [ 2452.800298]  <IRQ>
  [ 2452.800705]  skb_push.cold.98+0x14/0x20
  [ 2452.801396]  esp_xmit+0x17b/0x270 [esp4_offload]
  [ 2452.802799]  validate_xmit_xfrm+0x22f/0x2e0
  [ 2452.804285]  __dev_queue_xmit+0x589/0x910
  [ 2452.806264]  __neigh_update+0x3d7/0xa50
  [ 2452.806958]  arp_process+0x259/0x810
  [ 2452.807589]  arp_rcv+0x18a/0x1c

It was caused by the skb going to esp_xmit with a wrong transport
header.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-26 14:51:07 +01:00
Xin Long
7f9e40eb18 esp6: add gso_segment for esp6 beet mode
Similar to xfrm6_tunnel/transport_gso_segment(), _gso_segment()
is added to do gso_segment for esp6 beet mode. Before calling
inet6_offloads[proto]->callbacks.gso_segment, it needs to do:

  - Get the upper proto from ph header to get its gso_segment
    when xo->proto is IPPROTO_BEETPH.

  - Add SKB_GSO_TCPV6 to gso_type if x->sel.family != AF_INET6
    and the proto == IPPROTO_TCP, so that the current tcp ipv6
    packet can be segmented.

  - Calculate a right value for skb->transport_header and move
    skb->data to the transport header position.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-26 14:51:07 +01:00
Xin Long
384a46ea7b esp4: add gso_segment for esp4 beet mode
Similar to xfrm4_tunnel/transport_gso_segment(), _gso_segment()
is added to do gso_segment for esp4 beet mode. Before calling
inet_offloads[proto]->callbacks.gso_segment, it needs to do:

  - Get the upper proto from ph header to get its gso_segment
    when xo->proto is IPPROTO_BEETPH.

  - Add SKB_GSO_TCPV4 to gso_type if x->sel.family == AF_INET6
    and the proto == IPPROTO_TCP, so that the current tcp ipv4
    packet can be segmented.

  - Calculate a right value for skb->transport_header and move
    skb->data to the transport header position.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-26 14:51:07 +01:00
Parav Pandit
98fed6eb9b devlink: Rely on driver eswitch thread safety instead of devlink
devlink_nl_cmd_eswitch_set_doit() doesn't hold devlink->lock mutex while
invoking driver callback. This is likely due to eswitch mode setting
involves adding/remove devlink ports, health reporters or
other devlink objects for a devlink device.

So it is driver responsiblity to ensure thread safe eswitch state
transition happening via either sriov legacy enablement or via devlink
eswitch set callback.

Therefore, get() callback should also be invoked without holding
devlink->lock mutex.
Vendor driver can use same internal lock which it uses during eswitch
mode set() callback.
This makes get() and set() implimentation symmetric in devlink core and
in vendor drivers.

Hence, remove holding devlink->lock mutex during eswitch get() callback.

Failing to do so results into below deadlock scenario when mlx5_core
driver is improved to handle eswitch mode set critical section invoked
by devlink and sriov sysfs interface in subsequent patch.

devlink_nl_cmd_eswitch_set_doit()
   mlx5_eswitch_mode_set()
     mutex_lock(esw->mode_lock) <- Lock A
     [...]
     register_devlink_port()
       mutex_lock(&devlink->lock); <- lock B

mutex_lock(&devlink->lock); <- lock B
devlink_nl_cmd_eswitch_get_doit()
   mlx5_eswitch_mode_get()
   mutex_lock(esw->mode_lock) <- Lock A

In subsequent patch, mlx5_core driver uses its internal lock during
get() and set() eswitch callbacks.

Other drivers have been inspected which returns either constant during
get operations or reads the value from already allocated structure.
Hence it is safe to remove the lock in get( ) callback and let vendor
driver handle it.

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Mark Bloch <markb@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-03-25 23:19:18 -07:00
David S. Miller
9fb16955fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Overlapping header include additions in macsec.c

A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c

Overlapping test additions in selftests Makefile

Overlapping PCI ID table adjustments in iwlwifi driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25 18:58:11 -07:00
Luiz Augusto von Dentz
da49b602f7 Bluetooth: L2CAP: Use DEFER_SETUP to group ECRED connections
This uses the DEFER_SETUP flag to group channels with
L2CAP_CREDIT_BASED_CONNECTION_REQ.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-25 22:16:08 +01:00
Linus Torvalds
1b649e0bca Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix deadlock in bpf_send_signal() from Yonghong Song.

 2) Fix off by one in kTLS offload of mlx5, from Tariq Toukan.

 3) Add missing locking in iwlwifi mvm code, from Avraham Stern.

 4) Fix MSG_WAITALL handling in rxrpc, from David Howells.

 5) Need to hold RTNL mutex in tcindex_partial_destroy_work(), from Cong
    Wang.

 6) Fix producer race condition in AF_PACKET, from Willem de Bruijn.

 7) cls_route removes the wrong filter during change operations, from
    Cong Wang.

 8) Reject unrecognized request flags in ethtool netlink code, from
    Michal Kubecek.

 9) Need to keep MAC in reset until PHY is up in bcmgenet driver, from
    Doug Berger.

10) Don't leak ct zone template in act_ct during replace, from Paul
    Blakey.

11) Fix flushing of offloaded netfilter flowtable flows, also from Paul
    Blakey.

12) Fix throughput drop during tx backpressure in cxgb4, from Rahul
    Lakkireddy.

13) Don't let a non-NULL skb->dev leave the TCP stack, from Eric
    Dumazet.

14) TCP_QUEUE_SEQ socket option has to update tp->copied_seq as well,
    also from Eric Dumazet.

15) Restrict macsec to ethernet devices, from Willem de Bruijn.

16) Fix reference leak in some ethtool *_SET handlers, from Michal
    Kubecek.

17) Fix accidental disabling of MSI for some r8169 chips, from Heiner
    Kallweit.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (138 commits)
  net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
  net: ena: Add PCI shutdown handler to allow safe kexec
  selftests/net/forwarding: define libs as TEST_PROGS_EXTENDED
  selftests/net: add missing tests to Makefile
  r8169: re-enable MSI on RTL8168c
  net: phy: mdio-bcm-unimac: Fix clock handling
  cxgb4/ptp: pass the sign of offset delta in FW CMD
  net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop
  net: cbs: Fix software cbs to consider packet sending time
  net/mlx5e: Do not recover from a non-fatal syndrome
  net/mlx5e: Fix ICOSQ recovery flow with Striding RQ
  net/mlx5e: Fix missing reset of SW metadata in Striding RQ reset
  net/mlx5e: Enhance ICOSQ WQE info fields
  net/mlx5_core: Set IB capability mask1 to fix ib_srpt connection failure
  selftests: netfilter: add nfqueue test case
  netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
  netfilter: nft_fwd_netdev: validate family and chain type
  netfilter: nft_set_rbtree: Detect partial overlaps on insertion
  netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()
  netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion
  ...
2020-03-25 13:58:05 -07:00
Pablo Neira Ayuso
2c64605b59 net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
    net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
      pkt->skb->tc_redirected = 1;
              ^~
    net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
      pkt->skb->tc_from_ingress = 1;
              ^~

To avoid a direct dependency with tc actions from netfilter, wrap the
redirect bits around CONFIG_NET_REDIRECT and move helpers to
include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
only existing client of these bits in the tree.

This patch adds skb_set_redirected() that sets on the redirected bit
on the skbuff, it specifies if the packet was redirect from ingress
and resets the timestamp (timestamp reset was originally missing in the
netfilter bugfix).

Fixes: bcfabee1af ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
Reported-by: noreply@ellerman.id.au
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25 12:24:33 -07:00
Eric Dumazet
29f3490ba9 net: use indirect call wrappers for skb_copy_datagram_iter()
TCP recvmsg() calls skb_copy_datagram_iter(), which
calls an indirect function (cb pointing to simple_copy_to_iter())
for every MSS (fragment) present in the skb.

CONFIG_RETPOLINE=y forces a very expensive operation
that we can avoid thanks to indirect call wrappers.

This patch gives a 13% increase of performance on
a single flow, if the bottleneck is the thread reading
the TCP socket.

Fixes: 950fcaecd5 ("datagram: consolidate datagram copy to iter helpers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25 11:30:40 -07:00
Alain Michaud
32b50729d9 Bluetooth: don't assume key size is 16 when the command fails
With this change, the encryption key size is not assumed to be 16 if the
read_encryption_key_size command fails for any reason.  This ensures
that if the controller fails the command for any reason that the
encryption key size isn't implicitely set to 16 and instead take a more
concervative posture to assume it is 0.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-25 15:50:35 +01:00
Masahiro Yamada
d198b34f38 .gitignore: add SPDX License Identifier
Add SPDX License Identifier to all .gitignore files.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2020-03-25 11:50:48 +01:00
Johannes Berg
0016d32017 nl80211: fix NL80211_ATTR_CHANNEL_WIDTH attribute type
The new opmode notification used this attribute with a u8, when
it's documented as a u32 and indeed used in userspace as such,
it just happens to work on little-endian systems since userspace
isn't doing any strict size validation, and the u8 goes into the
lower byte. Fix this.

Cc: stable@vger.kernel.org
Fixes: 466b9936bf ("cfg80211: Add support to notify station's opmode change to userspace")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200325090531.be124f0a11c7.Iedbf4e197a85471ebd729b186d5365c0343bf7a8@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-25 09:58:43 +01:00
David S. Miller
6f000f9878 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) A new selftest for nf_queue, from Florian Westphal. This test
   covers two recent fixes: 07f8e4d0fd ("tcp: also NULL skb->dev
   when copy was needed") and b738a185be ("tcp: ensure skb->dev is
   NULL before leaving TCP stack").

2) The fwd action breaks with ifb. For safety in next extensions,
   make sure the fwd action only runs from ingress until it is extended
   to be used from a different hook.

3) The pipapo set type now reports EEXIST in case of subrange overlaps.
   Update the rbtree set to validate range overlaps, so far this
   validation is only done only from userspace. From Stefano Brivio.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24 17:30:40 -07:00
Vladyslav Tarasiuk
9d648fb560 ethtool: fix incorrect tx-checksumming settings reporting
Currently, ethtool feature mask for checksum command is ORed with
NETIF_F_FCOE_CRC_BIT, which is bit's position number, instead of the
actual feature bit - NETIF_F_FCOE_CRC.

The invalid bitmask here might affect unrelated features when toggling
TX checksumming. For example, TX checksumming is always mistakenly
reported as enabled on the netdevs tested (mlx5, virtio_net).

Fixes: f70bb06563 ("ethtool: update mapping of features to legacy ioctl requests")
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24 16:25:11 -07:00
Vladimir Oltean
e80f40cbe4 net: dsa: tag_8021q: replace dsa_8021q_remove_header with __skb_vlan_pop
Not only did this wheel did not need reinventing, but there is also
an issue with it: It doesn't remove the VLAN header in a way that
preserves the L2 payload checksum when that is being provided by the DSA
master hw.  It should recalculate checksum both for the push, before
removing the header, and for the pull afterwards. But the current
implementation is quite dizzying, with pulls followed immediately
afterwards by pushes, the memmove is done before the push, etc.  This
makes a DSA master with RX checksumming offload to print stack traces
with the infamous 'hw csum failure' message.

So remove the dsa_8021q_remove_header function and replace it with
something that actually works with inet checksumming.

Fixes: d461933638 ("net: dsa: tag_8021q: Create helper function for removing VLAN header")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24 16:19:01 -07:00
Zh-yuan Ye
961d0e5b32 net: cbs: Fix software cbs to consider packet sending time
Currently the software CBS does not consider the packet sending time
when depleting the credits. It caused the throughput to be
Idleslope[kbps] * (Port transmit rate[kbps] / |Sendslope[kbps]|) where
Idleslope * (Port transmit rate / (Idleslope + |Sendslope|)) = Idleslope
is expected. In order to fix the issue above, this patch takes the time
when the packet sending completes into account by moving the anchor time
variable "last" ahead to the send completion time upon transmission and
adding wait when the next dequeue request comes before the send
completion time of the previous packet.

changelog:
V2->V3:
 - remove unnecessary whitespace cleanup
 - add the checks if port_rate is 0 before division

V1->V2:
 - combine variable "send_completed" into "last"
 - add the comment for estimate of the packet sending

Fixes: 585d763af0 ("net/sched: Introduce Credit Based Shaper (CBS) qdisc")
Signed-off-by: Zh-yuan Ye <ye.zh-yuan@socionext.com>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-24 16:14:05 -07:00
Pablo Neira Ayuso
bcfabee1af netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress
Set skb->tc_redirected to 1, otherwise the ifb driver drops the packet.
Set skb->tc_from_ingress to 1 to reinject the packet back to the ingress
path after leaving the ifb egress path.

This patch inconditionally sets on these two skb fields that are
meaningful to the ifb driver. The existing forward action is guaranteed
to run from ingress path.

Fixes: 39e6dea28a ("netfilter: nf_tables: add forward expression to the netdev family")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24 19:59:39 +01:00
Pablo Neira Ayuso
76a109fac2 netfilter: nft_fwd_netdev: validate family and chain type
Make sure the forward action is only used from ingress.

Fixes: 39e6dea28a ("netfilter: nf_tables: add forward expression to the netdev family")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24 19:59:38 +01:00
Stefano Brivio
7c84d41416 netfilter: nft_set_rbtree: Detect partial overlaps on insertion
...and return -ENOTEMPTY to the front-end in this case, instead of
proceeding. Currently, nft takes care of checking for these cases
and not sending them to the kernel, but if we drop the set_overlap()
call in nft we can end up in situations like:

 # nft add table t
 # nft add set t s '{ type inet_service ; flags interval ; }'
 # nft add element t s '{ 1 - 5 }'
 # nft add element t s '{ 6 - 10 }'
 # nft add element t s '{ 4 - 7 }'
 # nft list set t s
 table ip t {
 	set s {
 		type inet_service
 		flags interval
 		elements = { 1-3, 4-5, 6-7 }
 	}
 }

This change has the primary purpose of making the behaviour
consistent with nft_set_pipapo, but is also functional to avoid
inconsistent behaviour if userspace sends overlapping elements for
any reason.

v2: When we meet the same key data in the tree, as start element while
    inserting an end element, or as end element while inserting a start
    element, actually check that the existing element is active, before
    resetting the overlap flag (Pablo Neira Ayuso)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24 19:59:37 +01:00
Stefano Brivio
6f7c9caf01 netfilter: nft_set_rbtree: Introduce and use nft_rbtree_interval_start()
Replace negations of nft_rbtree_interval_end() with a new helper,
nft_rbtree_interval_start(), wherever this helps to visualise the
problem at hand, that is, for all the occurrences except for the
comparison against given flags in __nft_rbtree_get().

This gets especially useful in the next patch.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24 19:59:30 +01:00
Stefano Brivio
0eb4b5ee33 netfilter: nft_set_pipapo: Separate partial and complete overlap cases on insertion
...and return -ENOTEMPTY to the front-end on collision, -EEXIST if
an identical element already exists. Together with the previous patch,
element collision will now be returned to the user as -EEXIST.

Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24 19:59:08 +01:00
Pablo Neira Ayuso
8c2d45b2b6 netfilter: nf_tables: Allow set back-ends to report partial overlaps on insertion
Currently, the -EEXIST return code of ->insert() callbacks is ambiguous: it
might indicate that a given element (including intervals) already exists as
such, or that the new element would clash with existing ones.

If identical elements already exist, the front-end is ignoring this without
returning error, in case NLM_F_EXCL is not set. However, if the new element
can't be inserted due an overlap, we should report this to the user.

To this purpose, allow set back-ends to return -ENOTEMPTY on collision with
existing elements, translate that to -EEXIST, and return that to userspace,
no matter if NLM_F_EXCL was set.

Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-24 19:58:57 +01:00
Luiz Augusto von Dentz
b48596d1dc Bluetooth: L2CAP: Add get_peer_pid callback
This adds a callback to read the socket pid.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-24 09:48:00 +01:00
YueHaibing
4c59406ed0 xfrm: policy: Fix doulbe free in xfrm_policy_timer
After xfrm_add_policy add a policy, its ref is 2, then

                             xfrm_policy_timer
                               read_lock
                               xp->walk.dead is 0
                               ....
                               mod_timer()
xfrm_policy_kill
  policy->walk.dead = 1
  ....
  del_timer(&policy->timer)
    xfrm_pol_put //ref is 1
  xfrm_pol_put  //ref is 0
    xfrm_policy_destroy
      call_rcu
                                 xfrm_pol_hold //ref is 1
                               read_unlock
                               xfrm_pol_put //ref is 0
                                 xfrm_policy_destroy
                                  call_rcu

xfrm_policy_destroy is called twice, which may leads to
double free.

Call Trace:
RIP: 0010:refcount_warn_saturate+0x161/0x210
...
 xfrm_policy_timer+0x522/0x600
 call_timer_fn+0x1b3/0x5e0
 ? __xfrm_decode_session+0x2990/0x2990
 ? msleep+0xb0/0xb0
 ? _raw_spin_unlock_irq+0x24/0x40
 ? __xfrm_decode_session+0x2990/0x2990
 ? __xfrm_decode_session+0x2990/0x2990
 run_timer_softirq+0x5c5/0x10e0

Fix this by use write_lock_bh in xfrm_policy_kill.

Fixes: ea2dea9dac ("xfrm: remove policy lock when accessing policy->walk.dead")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Timo Teräs <timo.teras@iki.fi>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-24 06:56:54 +01:00
David Laight
af13b3c338 Remove DST_HOST
Previous changes to the IP routing code have removed all the
tests for the DS_HOST route flag.
Remove the flags and all the code that sets it.

Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:57:44 -07:00
Michal Kubecek
2f599ec422 ethtool: fix reference leak in some *_SET handlers
Andrew noticed that some handlers for *_SET commands leak a netdev
reference if required ethtool_ops callbacks do not exist. A simple
reproducer would be e.g.

  ip link add veth1 type veth peer name veth2
  ethtool -s veth1 wol g
  ip link del veth1

Make sure dev_put() is called when ethtool_ops check fails.

v2: add Fixes tags

Fixes: a53f3d41e4 ("ethtool: set link settings with LINKINFO_SET request")
Fixes: bfbcfe2032 ("ethtool: set link modes related data with LINKMODES_SET request")
Fixes: e54d04e3af ("ethtool: set message mask with DEBUG_SET request")
Fixes: 8d425b19b3 ("ethtool: set wake-on-lan settings with WOL_SET request")
Reported-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:50:14 -07:00
Florian Fainelli
52015366e3 net: dsa: Implement flow dissection for tag_brcm.c
Provide a flow_dissect callback which returns the network offset and
where to find the skb protocol, given the tags structure a common
function works for both tagging formats that are supported.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:48:59 -07:00
Florian Fainelli
0e62f543be net: dsa: Fix duplicate frames flooded by learning
When both the switch and the bridge are learning about new addresses,
switch ports attached to the bridge would see duplicate ARP frames
because both entities would attempt to send them.

Fixes: 5037d532b8 ("net: dsa: add Broadcom tag RX/TX handler")
Reported-by: Maxime Bizon <mbizon@freebox.fr>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:44:45 -07:00
Ido Schimmel
107f167894 devlink: Only pass packet trap group identifier in trap structure
Packet trap groups are now explicitly registered by drivers and not
implicitly registered when the packet traps are registered. Therefore,
there is no need to encode entire group structure the trap is associated
with inside the trap structure.

Instead, only pass the group identifier. Refer to it as initial group
identifier, as future patches will allow user space to move traps
between groups.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:40:40 -07:00
Ido Schimmel
a09b37fc8b devlink: Stop reference counting packet trap groups
Now that drivers explicitly register their supported packet trap groups
there is no for devlink to create them on-demand and destroy them when
their reference count reaches zero.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:40:40 -07:00
Ido Schimmel
95ad9555b5 devlink: Add API to register packet trap groups
Currently, packet trap groups are implicitly registered by drivers upon
packet trap registration. When the traps are registered, each is
associated with a group and the group is created by devlink, if it does
not exist already.

This makes it difficult for drivers to pass additional attributes for
the groups.

Therefore, as a preparation for future patches that require passing
additional group attributes, add an API to explicitly register /
unregister these groups.

Next patches will convert existing drivers to use this API.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:40:40 -07:00
Qian Cai
dddeb30bfc ipv4: fix a RCU-list lock in inet_dump_fib()
There is a place,

inet_dump_fib()
  fib_table_dump
    fn_trie_dump_leaf()
      hlist_for_each_entry_rcu()

without rcu_read_lock() will trigger a warning,

 WARNING: suspicious RCU usage
 -----------------------------
 net/ipv4/fib_trie.c:2216 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by ip/1923:
  #0: ffffffff8ce76e40 (rtnl_mutex){+.+.}, at: netlink_dump+0xd6/0x840

 Call Trace:
  dump_stack+0xa1/0xea
  lockdep_rcu_suspicious+0x103/0x10d
  fn_trie_dump_leaf+0x581/0x590
  fib_table_dump+0x15f/0x220
  inet_dump_fib+0x4ad/0x5d0
  netlink_dump+0x350/0x840
  __netlink_dump_start+0x315/0x3e0
  rtnetlink_rcv_msg+0x4d1/0x720
  netlink_rcv_skb+0xf0/0x220
  rtnetlink_rcv+0x15/0x20
  netlink_unicast+0x306/0x460
  netlink_sendmsg+0x44b/0x770
  __sys_sendto+0x259/0x270
  __x64_sys_sendto+0x80/0xa0
  do_syscall_64+0x69/0xf4
  entry_SYSCALL_64_after_hwframe+0x49/0xb3

Fixes: 18a8021a7b ("net/ipv4: Plumb support for filtering route dumps")
Signed-off-by: Qian Cai <cai@lca.pw>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 20:59:40 -07:00
Jakub Kicinski
0dfb2d82af net: sched: rename more stats_types
Commit 53eca1f347 ("net: rename flow_action_hw_stats_types* ->
flow_action_hw_stats*") renamed just the flow action types and
helpers. For consistency rename variables, enums, struct members
and UAPI too (note that this UAPI was not in any official release,
yet).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 20:54:23 -07:00
Davide Caratti
c3c123d16c net: mptcp: don't hang in mptcp_sendmsg() after TCP fallback
it's still possible for packetdrill to hang in mptcp_sendmsg(), when the
MPTCP socket falls back to regular TCP (e.g. after receiving unsupported
flags/version during the three-way handshake). Adjust MPTCP socket state
earlier, to ensure correct functionality of mptcp_sendmsg() even in case
of TCP fallback.

Fixes: 767d3ded5f ("net: mptcp: don't hang before sending 'MP capable with data'")
Fixes: 1954b86016 ("mptcp: Check connection state before attempting send")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 20:53:25 -07:00
Eric Dumazet
6cd6cbf593 tcp: repair: fix TCP_QUEUE_SEQ implementation
When application uses TCP_QUEUE_SEQ socket option to
change tp->rcv_next, we must also update tp->copied_seq.

Otherwise, stuff relying on tcp_inq() being precise can
eventually be confused.

For example, tcp_zerocopy_receive() might crash because
it does not expect tcp_recv_skb() to return NULL.

We could add tests in various places to fix the issue,
or simply make sure tcp_inq() wont return a random value,
and leave fast path as it is.

Note that this fixes ioctl(fd, SIOCINQ, &val) at the same
time.

Fixes: ee9952831c ("tcp: Initial repair mode")
Fixes: 05255b823a ("tcp: add TCP_ZEROCOPY_RECEIVE support for zerocopy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 13:02:01 -07:00
Yadu Kishore
1454c9fa91 net: Make skb_segment not to compute checksum if network controller supports checksumming
Problem:
TCP checksum in the output path is not being offloaded during GSO
in the following case:
The network driver does not support scatter-gather but supports
checksum offload with NETIF_F_HW_CSUM.

Cause:
skb_segment calls skb_copy_and_csum_bits if the network driver
does not announce NETIF_F_SG. It does not check if the driver
supports NETIF_F_HW_CSUM.
So for devices which might want to offload checksum but do not support SG
there is currently no way to do so if GSO is enabled.

Solution:
In skb_segment check if the network controller does checksum and if so
call skb_copy_bits instead of skb_copy_and_csum_bits.

Testing:
Without the patch, ran iperf TCP traffic with NETIF_F_HW_CSUM enabled
in the network driver. Observed the TCP checksum offload is not happening
since the skbs received by the driver in the output path have
skb->ip_summed set to CHECKSUM_NONE.

With the patch ran iperf TCP traffic and observed that TCP checksum
is being offloaded with skb->ip_summed set to CHECKSUM_PARTIAL.
Also tested with the patch by disabling NETIF_F_HW_CSUM in the driver
to cover the newly introduced if-else code path in skb_segment.

Link: https://lore.kernel.org/netdev/CA+FuTSeYGYr3Umij+Mezk9CUcaxYwqEe5sPSuXF8jPE2yMFJAw@mail.gmail.com
Signed-off-by: Yadu Kishore <kyk.segfault@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 12:59:53 -07:00
Martin KaFai Lau
ab14fd4ee8 bpf: Add bpf_sk_storage support to bpf_tcp_ca
This patch adds bpf_sk_storage_get() and bpf_sk_storage_delete()
helper to the bpf_tcp_ca's struct_ops.  That would allow
bpf-tcp-cc to:
1) share sk private data with other bpf progs.
2) use bpf_sk_storage as a private storage for a bpf-tcp-cc
   if the existing icsk_ca_priv is not big enough.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320152101.2169498-1-kafai@fb.com
2020-03-23 20:51:55 +01:00
Abhishek Pandit-Subedi
2d186fcd6d Bluetooth: Fix incorrect branch in connection complete
When handling auto-connected devices, we should execute the rest of the
connection complete when it was previously discovered and it is an ACL
connection.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-23 18:48:50 +01:00
Abhishek Pandit-Subedi
8731840a34 Bluetooth: Restore running state if suspend fails
If Bluetooth fails to enter the suspended state correctly, restore the
state to running (re-enabling scans). PM_POST_SUSPEND is only sent to
notifiers that successfully return from PM_PREPARE_SUSPEND notification
so we should recover gracefully if it fails.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-23 18:48:50 +01:00
Ilya Dryomov
e886274031 libceph: fix alloc_msg_with_page_vector() memory leaks
Make it so that CEPH_MSG_DATA_PAGES data item can own pages,
fixing a bunch of memory leaks for a page vector allocated in
alloc_msg_with_page_vector().  Currently, only watch-notify
messages trigger this allocation, and normally the page vector
is freed either in handle_watch_notify() or by the caller of
ceph_osdc_notify().  But if the message is freed before that
(e.g. if the session faults while reading in the message or
if the notify is stale), we leak the page vector.

This was supposed to be fixed by switching to a message-owned
pagelist, but that never happened.

Fixes: 1907920324 ("libceph: support for sending notifies")
Reported-by: Roman Penyaev <rpenyaev@suse.de>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Roman Penyaev <rpenyaev@suse.de>
2020-03-23 13:07:08 +01:00
Ilya Dryomov
7614209736 ceph: check POOL_FLAG_FULL/NEARFULL in addition to OSDMAP_FULL/NEARFULL
CEPH_OSDMAP_FULL/NEARFULL aren't set since mimic, so we need to consult
per-pool flags as well.  Unfortunately the backwards compatibility here
is lacking:

- the change that deprecated OSDMAP_FULL/NEARFULL went into mimic, but
  was guarded by require_osd_release >= RELEASE_LUMINOUS
- it was subsequently backported to luminous in v12.2.2, but that makes
  no difference to clients that only check OSDMAP_FULL/NEARFULL because
  require_osd_release is not client-facing -- it is for OSDs

Since all kernels are affected, the best we can do here is just start
checking both map flags and pool flags and send that to stable.

These checks are best effort, so take osdc->lock and look up pool flags
just once.  Remove the FIXME, since filesystem quotas are checked above
and RADOS quotas are reflected in POOL_FLAG_FULL: when the pool reaches
its quota, both POOL_FLAG_FULL and POOL_FLAG_FULL_QUOTA are set.

Cc: stable@vger.kernel.org
Reported-by: Yanhu Cao <gmayyyha@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Acked-by: Sage Weil <sage@redhat.com>
2020-03-23 13:07:08 +01:00
YueHaibing
09984483db mptcp: Remove set but not used variable 'can_ack'
Fixes gcc '-Wunused-but-set-variable' warning:

net/mptcp/options.c: In function 'mptcp_established_options_dss':
net/mptcp/options.c:338:7: warning:
 variable 'can_ack' set but not used [-Wunused-but-set-variable]

commit dc093db5cc ("mptcp: drop unneeded checks")
leave behind this unused, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21 20:20:36 -07:00
Jakub Sitnicki
d5bee7374b net/tls: Annotate access to sk_prot with READ_ONCE/WRITE_ONCE
sockmap performs lockless writes to sk->sk_prot on the following paths:

tcp_bpf_{recvmsg|sendmsg} / sock_map_unref
  sk_psock_put
    sk_psock_drop
      sk_psock_restore_proto
        WRITE_ONCE(sk->sk_prot, proto)

To prevent load/store tearing [1], and to make tooling aware of intentional
shared access [2], we need to annotate other sites that access sk_prot with
READ_ONCE/WRITE_ONCE macros.

Change done with Coccinelle with following semantic patch:

@@
expression E;
identifier I;
struct sock *sk;
identifier sk_prot =~ "^sk_prot$";
@@
(
 E =
-sk->sk_prot
+READ_ONCE(sk->sk_prot)
|
-sk->sk_prot = E
+WRITE_ONCE(sk->sk_prot, E)
|
-sk->sk_prot
+READ_ONCE(sk->sk_prot)
 ->I
)

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21 20:08:17 -07:00
Jakub Sitnicki
5bb4c45d46 net/tls: Read sk_prot once when building tls proto ops
Apart from being a "tremendous" win when it comes to generated machine
code (see bloat-o-meter output for x86-64 below) this mainly prepares
ground for annotating access to sk_prot with READ_ONCE, so that we don't
pepper the code with access annotations and needlessly repeat loads.

add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-46 (-46)
Function                                     old     new   delta
tls_init                                     851     805     -46
Total: Before=21063, After=21017, chg -0.22%

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21 20:08:17 -07:00
Jakub Sitnicki
f13fe3e60c net/tls: Constify base proto ops used for building tls proto
The helper that builds kTLS proto ops doesn't need to and should not modify
the base proto ops. Annotate the parameter as read-only.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21 20:08:17 -07:00
Taehee Yoo
3a303cfdd2 hsr: fix general protection fault in hsr_addr_is_self()
The port->hsr is used in the hsr_handle_frame(), which is a
callback of rx_handler.
hsr master and slaves are initialized in hsr_add_port().
This function initializes several pointers, which includes port->hsr after
registering rx_handler.
So, in the rx_handler routine, un-initialized pointer would be used.
In order to fix this, pointers should be initialized before
registering rx_handler.

Test commands:
    ip netns del left
    ip netns del right
    modprobe -rv veth
    modprobe -rv hsr
    killall ping
    modprobe hsr
    ip netns add left
    ip netns add right
    ip link add veth0 type veth peer name veth1
    ip link add veth2 type veth peer name veth3
    ip link add veth4 type veth peer name veth5
    ip link set veth1 netns left
    ip link set veth3 netns right
    ip link set veth4 netns left
    ip link set veth5 netns right
    ip link set veth0 up
    ip link set veth2 up
    ip link set veth0 address fc:00:00:00:00:01
    ip link set veth2 address fc:00:00:00:00:02
    ip netns exec left ip link set veth1 up
    ip netns exec left ip link set veth4 up
    ip netns exec right ip link set veth3 up
    ip netns exec right ip link set veth5 up
    ip link add hsr0 type hsr slave1 veth0 slave2 veth2
    ip a a 192.168.100.1/24 dev hsr0
    ip link set hsr0 up
    ip netns exec left ip link add hsr1 type hsr slave1 veth1 slave2 veth4
    ip netns exec left ip a a 192.168.100.2/24 dev hsr1
    ip netns exec left ip link set hsr1 up
    ip netns exec left ip n a 192.168.100.1 dev hsr1 lladdr \
	    fc:00:00:00:00:01 nud permanent
    ip netns exec left ip n r 192.168.100.1 dev hsr1 lladdr \
	    fc:00:00:00:00:01 nud permanent
    for i in {1..100}
    do
        ip netns exec left ping 192.168.100.1 &
    done
    ip netns exec left hping3 192.168.100.1 -2 --flood &
    ip netns exec right ip link add hsr2 type hsr slave1 veth3 slave2 veth5
    ip netns exec right ip a a 192.168.100.3/24 dev hsr2
    ip netns exec right ip link set hsr2 up
    ip netns exec right ip n a 192.168.100.1 dev hsr2 lladdr \
	    fc:00:00:00:00:02 nud permanent
    ip netns exec right ip n r 192.168.100.1 dev hsr2 lladdr \
	    fc:00:00:00:00:02 nud permanent
    for i in {1..100}
    do
        ip netns exec right ping 192.168.100.1 &
    done
    ip netns exec right hping3 192.168.100.1 -2 --flood &
    while :
    do
        ip link add hsr0 type hsr slave1 veth0 slave2 veth2
	ip a a 192.168.100.1/24 dev hsr0
	ip link set hsr0 up
	ip link del hsr0
    done

Splat looks like:
[  120.954938][    C0] general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1]I
[  120.957761][    C0] KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
[  120.959064][    C0] CPU: 0 PID: 1511 Comm: hping3 Not tainted 5.6.0-rc5+ #460
[  120.960054][    C0] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  120.962261][    C0] RIP: 0010:hsr_addr_is_self+0x65/0x2a0 [hsr]
[  120.963149][    C0] Code: 44 24 18 70 73 2f c0 48 c1 eb 03 48 8d 04 13 c7 00 f1 f1 f1 f1 c7 40 04 00 f2 f2 f2 4
[  120.966277][    C0] RSP: 0018:ffff8880d9c09af0 EFLAGS: 00010206
[  120.967293][    C0] RAX: 0000000000000006 RBX: 1ffff1101b38135f RCX: 0000000000000000
[  120.968516][    C0] RDX: dffffc0000000000 RSI: ffff8880d17cb208 RDI: 0000000000000000
[  120.969718][    C0] RBP: 0000000000000030 R08: ffffed101b3c0e3c R09: 0000000000000001
[  120.972203][    C0] R10: 0000000000000001 R11: ffffed101b3c0e3b R12: 0000000000000000
[  120.973379][    C0] R13: ffff8880aaf80100 R14: ffff8880aaf800f2 R15: ffff8880aaf80040
[  120.974410][    C0] FS:  00007f58e693f740(0000) GS:ffff8880d9c00000(0000) knlGS:0000000000000000
[  120.979794][    C0] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  120.980773][    C0] CR2: 00007ffcb8b38f29 CR3: 00000000afe8e001 CR4: 00000000000606f0
[  120.981945][    C0] Call Trace:
[  120.982411][    C0]  <IRQ>
[  120.982848][    C0]  ? hsr_add_node+0x8c0/0x8c0 [hsr]
[  120.983522][    C0]  ? rcu_read_lock_held+0x90/0xa0
[  120.984159][    C0]  ? rcu_read_lock_sched_held+0xc0/0xc0
[  120.984944][    C0]  hsr_handle_frame+0x1db/0x4e0 [hsr]
[  120.985597][    C0]  ? hsr_nl_nodedown+0x2b0/0x2b0 [hsr]
[  120.986289][    C0]  __netif_receive_skb_core+0x6bf/0x3170
[  120.992513][    C0]  ? check_chain_key+0x236/0x5d0
[  120.993223][    C0]  ? do_xdp_generic+0x1460/0x1460
[  120.993875][    C0]  ? register_lock_class+0x14d0/0x14d0
[  120.994609][    C0]  ? __netif_receive_skb_one_core+0x8d/0x160
[  120.995377][    C0]  __netif_receive_skb_one_core+0x8d/0x160
[  120.996204][    C0]  ? __netif_receive_skb_core+0x3170/0x3170
[ ... ]

Reported-by: syzbot+fcf5dd39282ceb27108d@syzkaller.appspotmail.com
Fixes: c5a7591172 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-21 19:49:34 -07:00
Linus Torvalds
1ab7ea1f83 io_uring-5.6-20200320
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl51dbQQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpiV3EADJHB2r2hTTEym5u1PbrEEVkjvdL6InU8lD
 lFM7m2g6yZUncwm+aSZynHqAFY6Rd5Jk+gmYMuioi3ZxC2rs7jG1AOTpaeJYmhle
 lzkjqSLtl+gdPMA9ydivk1UwILFjtZKG1JNc++tnCn3q7+eCkgnWAlq5b7idG2eF
 BS0AEZP6Yz1zStTHLbHSB0StY8ovMIw0VaVQvguHLL9EBpbHmrs0cq3tipWkAyPR
 2YwnXbxsJySukkwmBKxEWrGUYDze56jqJIqdFsOE0+WtGV+nk7OScPseXAaP4/+G
 Vl23VNfryuZcsBUwI9tY1SzCFEXIwdXVGpCAYwQ/kU5WfvFpYaei+fXVNnL4kjR0
 PfpA6XnMsZ3DzqgepmUd92sAA56ZtBxuGjqcSYlg/JwjvUHdpaZDkE2WLqkAMeUN
 8A7cUw+R6XWQ2/y6ob7QvKiT/ZDR8GrYUl3EdGE3LhB1ZsvLXJDZpWipwQBzuk9R
 vJJOkGst38rjsWnb+nfeLh3AsgjF14wo+2vQL4mKs24xKTIvadHsFAZjKLXZ93Wf
 Vn58FaPOYIkjBidYLWb3dlO1ZR8S0803gohLkLV6adH8bCNCWxGTOR51DZLomAsb
 nAUCEAJaZrOqaQAuJAFNNpS8+/da3AIF4HVd2EdZ1yFXU15y0+zIxtROjKzg+OxO
 M3jC/Aet1Q==
 =IMcu
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block

Pull io_uring fixes from Jens Axboe:
 "Two different fixes in here:

   - Fix for a potential NULL pointer deref for links with async or
     drain marked (Pavel)

   - Fix for not properly checking RLIMIT_NOFILE for async punted
     operations.

     This affects openat/openat2, which were added this cycle, and
     accept4. I did a full audit of other cases where we might check
     current->signal->rlim[] and found only RLIMIT_FSIZE for buffered
     writes and fallocate. That one is fixed and queued for 5.7 and
     marked stable"

* tag 'io_uring-5.6-20200320' of git://git.kernel.dk/linux-block:
  io_uring: make sure accept honor rlimit nofile
  io_uring: make sure openat/openat2 honor rlimit nofile
  io_uring: NULL-deref for IOSQE_{ASYNC,DRAIN}
2020-03-21 11:54:47 -07:00
Florian Westphal
07f8e4d0fd tcp: also NULL skb->dev when copy was needed
In rare cases retransmit logic will make a full skb copy, which will not
trigger the zeroing added in recent change
b738a185be ("tcp: ensure skb->dev is NULL before leaving TCP stack").

Cc: Eric Dumazet <edumazet@google.com>
Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
Fixes: 28f8bfd1ac ("netfilter: Support iif matches in POSTROUTING")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20 19:36:38 -07:00
David S. Miller
702151da6d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Refetch IP header pointer after pskb_may_pull() in flowtable,
   from Haishuang Yan.

2) Fix memleak in flowtable offload in nf_flow_table_free(),
   from Paul Blakey.

3) Set control.addr_type mask in flowtable offload, from Edward Cree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20 19:34:18 -07:00
David S. Miller
0d7043f355 Another set of changes:
* HE ranging (fine timing measurement) API support
  * hwsim gets virtio support, for use with wmediumd,
    to be able to simulate with multiple machines
  * eapol-over-nl80211 improvements to exclude preauth
  * IBSS reset support, to recover connections from
    userspace
  * and various others.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl50yM8ACgkQB8qZga/f
 l8Qe0w//Rt5vsqlkiuZtg1SHQcMHq5Y6jJ92Xfg6Yw8H3H1g40r6UWDeSsF7Dl9H
 YuC4VvfNH/FhnNz01FDVxjgjnUyjpAolxM6sYj3hW3ZIBahSBeovAdjTP59y1Fhi
 zLMP/zYLsqtlKO5qr0fK7tUmENiF/IxyALRcxqsmQ1ul5003pIpFRVPoQVDTjHZn
 JCzDTboQLmtyYiDYjejaeauo4Awjj4lDiyk3Fu+kXSv8weY4dl+w8/AM5zsiVNDs
 4HRDpJUBIwfwWuygMNwl0EQVhnpYaNQFQ+IEe9B29qiAEb95MPHhSudU1qaT+4d7
 CykikhH39/n4Wv1PBg4PrCfWInxAUiPz6K027lDOY5VtJFFbLg8GqQkn3Vy/WnHm
 CJuVaLdJ73k7AypvWdZUgSe9SaloymVd987IIMixmz0t0cqIE3kPYot//kXFjy6m
 rqvWJ3a7SIopPdkD4xXDwzASUmctCUedQ0QVnWYRQCUW5rnkZ96SHjblUA9IykYS
 Ia44BwB8ggZsax1KrqyGV9hgj8mh9srLmEjtRPhxVD3ncE1ewiQaBRwv56J0EKNu
 yRnV10dNMCtUyT3j9dRpHa8zuNpS/A6ShXe4aGgoXAz+ShjNOBj2GxXtolsDJzrd
 zIqZH8gcWaeV+8yWhaGx8d82gvY8PhRvJl+AMax7Ds2avOZ/pa0=
 =t3QF
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Another set of changes:
 * HE ranging (fine timing measurement) API support
 * hwsim gets virtio support, for use with wmediumd,
   to be able to simulate with multiple machines
 * eapol-over-nl80211 improvements to exclude preauth
 * IBSS reset support, to recover connections from
   userspace
 * and various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20 08:57:38 -07:00
Nikolay Aleksandrov
c443758b21 net: bridge: vlan options: move the tunnel command to the nested attribute
Now that we have a nested tunnel info attribute we can add a separate
one for the tunnel command and require it explicitly from user-space. It
must be one of RTM_SETLINK/DELLINK. Only RTM_SETLINK requires a valid
tunnel id, DELLINK just removes it if it was set before. This allows us
to have all tunnel attributes and control in one place, thus removing
the need for an outside vlan info flag.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20 08:52:20 -07:00
Nikolay Aleksandrov
fa388f29a9 net: bridge: vlan options: nest the tunnel id into a tunnel info attribute
While discussing the new API, Roopa mentioned that we'll be adding more
tunnel attributes and options in the future, so it's better to make it a
nested attribute, since this is still in net-next we can easily change it
and nest the tunnel id attribute under BRIDGE_VLANDB_ENTRY_TUNNEL_INFO.

The new format is:
 [BRIDGE_VLANDB_ENTRY]
     [BRIDGE_VLANDB_ENTRY_TUNNEL_INFO]
         [BRIDGE_VLANDB_TINFO_ID]

Any new tunnel attributes can be nested under
BRIDGE_VLANDB_ENTRY_TUNNEL_INFO.

Suggested-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20 08:52:20 -07:00
YueHaibing
c0fd336ea4 bpf, tcp: Make tcp_bpf_recvmsg static
After commit f747632b60 ("bpf: sockmap: Move generic sockmap
hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of
tcp_bpf.c, so make it static and remove it from tcp.h. Also move
it to BPF_STREAM_PARSER #ifdef to fix unused function warnings.

Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
2020-03-20 15:56:55 +01:00
YueHaibing
a26527981a bpf, tcp: Fix unused function warnings
If BPF_STREAM_PARSER is not set, gcc warns:

  net/ipv4/tcp_bpf.c:483:12: warning: 'tcp_bpf_sendpage' defined but not used [-Wunused-function]
  net/ipv4/tcp_bpf.c:395:12: warning: 'tcp_bpf_sendmsg' defined but not used [-Wunused-function]
  net/ipv4/tcp_bpf.c:13:13: warning: 'tcp_bpf_stream_read' defined but not used [-Wunused-function]

Moves the unused functions into the #ifdef CONFIG_BPF_STREAM_PARSER.

Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-2-yuehaibing@huawei.com
2020-03-20 15:56:25 +01:00
Jens Axboe
09952e3e78 io_uring: make sure accept honor rlimit nofile
Just like commit 4022e7af86, this fixes the fact that
IORING_OP_ACCEPT ends up using get_unused_fd_flags(), which checks
current->signal->rlim[] for limits.

Add an extra argument to __sys_accept4_file() that allows us to pass
in the proper nofile limit, and grab it at request prep time.

Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-20 08:48:36 -06:00
Yan-Hsuan Chuang
8fa180bb4a mac80211: driver can remain on channel if not using chan_ctx
Some of the drivers are not using channel context, but let the
stack to control/switch channels instead. For such cases, driver
can still remain on channel because the mac80211 stack actually
supports it.

The stack will check if the driver is using chan_ctx and has
ops->remain_on_channel been hooked. Otherwise it will start its
ROC work to remain on channel. So, even if the driver is not
using chan_ctx, the driver is still capable of doing remain on
channel.

Signed-off-by: Yan-Hsuan Chuang <yhchuang@realtek.com>
Link: https://lore.kernel.org/r/20200312074337.16198-1-yhchuang@realtek.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:21 +01:00
Johannes Berg
306b79ea6e nl80211: clarify code in nl80211_del_station()
The long if chain of interface types is hard to read,
especially now with the additional condition after it.
Use a switch statement to clarify this code.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200320113834.2c51b9e8e341.I3fa5dc3f7d3cb1dbbd77191d764586f7da993f3f@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Veerendranath Jakkam
7fc82af856 cfg80211: Configure PMK lifetime and reauth threshold for PMKSA entries
Drivers that trigger roaming need to know the lifetime of the configured
PMKSA for deciding whether to trigger the full or PMKSA cache based
authentication. The configured PMKSA is invalid after the PMK lifetime
has expired and must not be used after that and the STA needs to
disassociate if the PMK expires. Hence the STA is expected to refresh
the PMK with a full authentication before this happens (e.g., when
reassociating to a new BSS the next time or by performing EAPOL
reauthentication depending on the AKM) to avoid unnecessary
disconnection.

The PMK reauthentication threshold is the percentage of the PMK lifetime
value and indicates to the driver to trigger a full authentication roam
(without PMKSA caching) after the reauthentication threshold time, but
before the PMK timer has expired. Authentication methods like SAE need
to be able to generate a new PMKSA entry without having to force a
disconnection after this threshold timeout. If no roaming occurs between
the reauthentication threshold time and PMK lifetime expiration,
disassociation is still forced.

The new attributes for providing these values correspond to the dot11
MIB variables dot11RSNAConfigPMKLifetime and
dot11RSNAConfigPMKReauthThreshold.

This type of functionality is already available in cases where user
space component is in control of roaming. This commit extends that same
capability into cases where parts or all of this functionality is
offloaded to the driver.

Signed-off-by: Veerendranath Jakkam <vjakkam@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200312235903.18462-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Seevalamuthu Mariappan
b255b72bc0 mac80211: Read rx_stats with perCPU pointers
Use perCPU pointers to get rx_stats in sta_set_sinfo
when RSS is enabled

Signed-off-by: Seevalamuthu Mariappan <seevalam@codeaurora.org>
Link: https://lore.kernel.org/r/1584526555-25960-1-git-send-email-seevalam@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Nicolas Cavallari
a916062a09 mac80211: Allow deleting stations in ibss mode to reset their state
Set the NL80211_EXT_FEATURE_DEL_IBSS_STA if the interface support IBSS
mode, so that stations can be reset from user space.

mac80211 already deletes stations by itself, so mac80211 drivers must
already support this.

This has been successfully tested with ath9k.

Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Link: https://lore.kernel.org/r/20200305135754.12094-2-cavallar@lri.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Nicolas Cavallari
edafcf4259 cfg80211: Add support for userspace to reset stations in IBSS mode
Sometimes, userspace is able to detect that a peer silently lost its
state (like, if the peer reboots). wpa_supplicant does this for IBSS-RSN
by registering for auth/deauth frames, but when it detects this, it is
only able to remove the encryption keys of the peer and close its port.

However, the kernel also hold other state about the station, such as BA
sessions, probe response parameters and the like.  They also need to be
resetted correctly.

This patch adds the NL80211_EXT_FEATURE_DEL_IBSS_STA feature flag
indicating the driver accepts deleting stations in IBSS mode, which
should send a deauth and reset the state of the station, just like in
mesh point mode.

Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Link: https://lore.kernel.org/r/20200305135754.12094-1-cavallar@lri.fr
[preserve -EINVAL return]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Johannes Berg
660d81dae8 mac80211: consider WLAN_EID_EXT_HE_OPERATION for parsing CRC
We use the parsing CRC for checking if the beacon changed, and
if the WLAN_EID_EXT_HE_OPERATION extended element changes we
need to track it so we can react to that. Include it in the CRC
calculation.

Link: https://lore.kernel.org/r/20200131111300.891737-22-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Shaul Triebitz
03efb863bb mac80211: HE: set missing bss_conf fields in AP mode
In AP mode, set htc_trig_based_pkt_ext and frame_time_rts_th
for driver use.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-19-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Shaul Triebitz
7e8d6f12bb nl80211: pass HE operation element to the driver
Pass the AP's HE operation element to the driver.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-18-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Avraham Stern
efb5520d0e nl80211/cfg80211: add support for non EDCA based ranging measurement
Add support for requesting that the ranging measurement will use
the trigger-based / non trigger-based flow instead of the EDCA based
flow.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-2-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Johannes Berg
95247705c4 mac80211: don't leave skb->next/prev pointing to stack
In beacon protection, don't leave skb->next/prev pointing to the
on-stack list, even if that's actually harmless since we don't use
them again afterwards.

While at it, check that the SKB on the list is still the same, as
that's required here. If not, the encryption (protection) code is
buggy.

Fixes: 0a3a84360b ("mac80211: Beacon protection using the new BIGTK (AP)")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200320102021.1be7823fc05e.Ia89fb79a0469d32137c9a04315a1d2dfc7b7d6f5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Markus Theil
7f3f96cedd mac80211: handle no-preauth flag for control port
This patch adds support for disabling pre-auth rx over the nl80211 control
port for mac80211.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200312091055.54257-3-markus.theil@tu-ilmenau.de
[fix indentation slightly, squash feature enablement]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Markus Theil
5631d96aa3 nl80211: add no pre-auth attribute and ext. feature flag for ctrl. port
If the nl80211 control port is used before this patch, pre-auth frames
(0x88c7) are send to userspace uncoditionally. While this enables userspace
to only use nl80211 on the station side, it is not always useful for APs.
Furthermore, pre-auth frames are ordinary data frames and not related to
the control port. Therefore it should for example be possible for pre-auth
frames to be bridged onto a wired network on AP side without touching
userspace.

For backwards compatibility to code already using pre-auth over nl80211,
this patch adds a feature flag to disable this behavior, while it remains
enabled by default. An additional ext. feature flag is added to detect this
from userspace.

Thanks to Jouni for pointing out, that pre-auth frames should be handled as
ordinary data frames.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200312091055.54257-2-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Eric Dumazet
b738a185be tcp: ensure skb->dev is NULL before leaving TCP stack
skb->rbnode is sharing three skb fields : next, prev, dev

When a packet is sent, TCP keeps the original skb (master)
in a rtx queue, which was converted to rbtree a while back.

__tcp_transmit_skb() is responsible to clone the master skb,
and add the TCP header to the clone before sending it
to network layer.

skb_clone() already clears skb->next and skb->prev, but copies
the master oskb->dev into the clone.

We need to clear skb->dev, otherwise lower layers could interpret
the value as a pointer to a netdev.

This old bug surfaced recently when commit 28f8bfd1ac
("netfilter: Support iif matches in POSTROUTING") was merged.

Before this netfilter commit, skb->dev value was ignored and
changed before reaching dev_queue_xmit()

Fixes: 75c119afe1 ("tcp: implement rb-tree based retransmit queue")
Fixes: 28f8bfd1ac ("netfilter: Support iif matches in POSTROUTING")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Martin Zaharinov <micron10@gmail.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:35:55 -07:00
David S. Miller
43861da75e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-03-19

Here's the main bluetooth-next pull request for the 5.7 kernel.

 - Added wideband speech support to mgmt and the ability for HCI drivers
   to declare support for it.
 - Added initial support for L2CAP Enhanced Credit Based Mode
 - Fixed suspend handling for several use cases
 - Fixed Extended Advertising related issues
 - Added support for Realtek 8822CE device
 - Added DT bindings for QTI chip WCN3991
 - Cleanups to replace zero-length arrays with flexible-array members
 - Several other smaller cleanups & fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:33:27 -07:00
Petr Machata
2ce124109c net: tc_skbedit: Make the skbedit priority offloadable
The skbedit action "priority" is used for adjusting SKB priority. Allow
drivers to offload the action by introducing two new skbedit getters and a
new flow action, and initializing appropriately in tc_setup_flow_action().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:09:19 -07:00
David S. Miller
3ac9eb4210 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl5zWDsACgkQ+7dXa6fL
 C2upyg/+KFOmCLFEAgwRnBn4zDDcdDT9du25Duv2d/XfAo2Zx+Nbwm7jjKR/mrRZ
 mRbcvb8qj92O4dzMCwcqDGpKT3xJmCZhxJQORBm55Bjme7tJDqXuQVYp1fZVy3Ka
 XJS0jr4n5HTorW8iGSIPJmE76XpIPq0ANhPnLbq8wZELyw87K7+J5ZdHcnUh+myd
 uKs8sIQ8PQZg6JBBj5wPRgrAkOFUTTINiUqy37ADIY1oZyzW1rUlAeAxVXV7Dnx7
 G1HvlVaDw72G1XG4pn0pNBCdGJuNF0dG2zRbdjS+kGCmf6MB6x8e22JjWW9r+r9m
 iJd4B2R/3V/kUn4i3B+jfOWD5DKzCW4lDixh9D2LzM16GUinYQTkrH9e8jMBBJGW
 7p7X9Vl3o0Nt6NDVLmTKuyomvvtT/jMYiDtKjPuvxlPCGduXB8HvNRFxsKIEVRHi
 4RcdTqUSOsyUnOvTfDTfyBu1srKFqTC3HzAunntV88UfGtWdhXRCWMejHdNK3uI9
 BC4Ym6jkmFnbQzytW/6noprvVlDfgAuyplcyhnnJ5fVNm4YQ7lZLZPgf5TS+gchI
 fMwDfRz3hOLDZ5WjCx6QLT1NHaowQLTrzTq0X3uj2ZrcnRORURvk8GfamzZoS9a5
 omyQgBfm+1YpF2VwCyU42DytdmFDUCDofKondOXh8QciwhXqaRs=
 =bvsn
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200319' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc, afs: Interruptibility fixes

Here are a number of fixes for AF_RXRPC and AFS that make AFS system calls
less interruptible and so less likely to leave the filesystem in an
uncertain state.  There's also a miscellaneous patch to make tracing
consistent.

 (1) Firstly, abstract out the Tx space calculation in sendmsg.  Much the
     same code is replicated in a number of places that subsequent patches
     are going to alter, including adding another copy.

 (2) Fix Tx interruptibility by allowing a kernel service, such as AFS, to
     request that a call be interruptible only when waiting for a call slot
     to become available (ie. the call has not taken place yet) or that a
     call be not interruptible at all (e.g. when we want to do writeback
     and don't want a signal interrupting a VM-induced writeback).

 (3) Increase the minimum delay on MSG_WAITALL for userspace sendmsg() when
     waiting for Tx buffer space as a 2*RTT delay is really small over 10G
     ethernet and a 1 jiffy timeout might be essentially 0 if at the end of
     the jiffy period.

 (4) Fix some tracing output in AFS to make it consistent with rxrpc.

 (5) Make sure aborted asynchronous AFS operations are tidied up properly
     so we don't end up with stuck rxrpc calls.

 (6) Make AFS client calls uninterruptible in the Rx phase.  If we don't
     wait for the reply to be fully gathered, we can't update the local VFS
     state and we end up in an indeterminate state with respect to the
     server.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 20:28:34 -07:00
Nikolay Aleksandrov
56d099761a net: bridge: vlan: include stats in dumps if requested
This patch adds support for vlan stats to be included when dumping vlan
information. We have to dump them only when explicitly requested (thus the
flag below) because that disables the vlan range compression and will make
the dump significantly larger. In order to request the stats to be
included we add a new dump attribute called BRIDGE_VLANDB_DUMP_FLAGS which
can affect dumps with the following first flag:
  - BRIDGE_VLANDB_DUMPF_STATS
The stats are intentionally nested and put into separate attributes to make
it easier for extending later since we plan to add per-vlan mcast stats,
drop stats and possibly STP stats. This is the last missing piece from the
new vlan API which makes the dumped vlan information complete.

A dump request which should include stats looks like:
 [BRIDGE_VLANDB_DUMP_FLAGS] |= BRIDGE_VLANDB_DUMPF_STATS

A vlandb entry attribute with stats looks like:
 [BRIDGE_VLANDB_ENTRY] = {
     [BRIDGE_VLANDB_ENTRY_STATS] = {
         [BRIDGE_VLANDB_STATS_RX_BYTES]
         [BRIDGE_VLANDB_STATS_RX_PACKETS]
         ...
     }
 }

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 20:21:47 -07:00
Paolo Abeni
0be534f5c0 mptcp: rename fourth ack field
The name is misleading, it actually tracks the 'fully established'
status.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 20:19:34 -07:00
Edward Cree
15ff197237 netfilter: flowtable: populate addr_type mask
nf_flow_rule_match() sets control.addr_type in key, so needs to also set
 the corresponding mask.  An exact match is wanted, so mask is all ones.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 21:20:04 +01:00
wenxu
dc264f1f7e netfilter: flowtable: fix NULL pointer dereference in tunnel offload support
The tc ct action does not cache the route in the flowtable entry.

Fixes: 88bf6e4114 ("netfilter: flowtable: add tunnel encap/decap action offload support")
Fixes: cfab6dbd0e ("netfilter: flowtable: add tunnel match offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 21:06:17 +01:00
Paul Blakey
c921ffe853 netfilter: flowtable: Fix flushing of offloaded flows on free
Freeing a flowtable with offloaded flows, the flow are deleted from
hardware but are not deleted from the flow table, leaking them,
and leaving their offload bit on.

Add a second pass of the disabled gc to delete the these flows from
the flow table before freeing it.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 21:05:30 +01:00
Haishuang Yan
41e9ec5a54 netfilter: flowtable: reload ip{v6}h in nf_flow_tuple_ip{v6}
Since pskb_may_pull may change skb->data, so we need to reload ip{v6}h at
the right place.

Fixes: a908fdec3d ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table")
Fixes: 7d20868717 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 21:05:05 +01:00
Haishuang Yan
61abaf02d2 netfilter: flowtable: reload ip{v6}h in nf_flow_nat_ip{v6}
Since nf_flow_snat_port and nf_flow_snat_ip{v6} call pskb_may_pull()
which may change skb->data, so we need to reload ip{v6}h at the right
place.

Fixes: a908fdec3d ("netfilter: nf_flow_table: move ipv6 offload hook code to nf_flow_table")
Fixes: 7d20868717 ("netfilter: nf_flow_table: move ipv4 offload hook code to nf_flow_table")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 21:04:50 +01:00
Pablo Neira Ayuso
475beb9c8d netfilter: nf_tables: add nft_set_elem_expr_destroy() and use it
This patch adds nft_set_elem_expr_destroy() to destroy stateful
expressions in set elements.

This patch also updates the commit path to call this function to invoke
expr->ops->destroy_clone when required.

This is implicitly fixing up a module reference counter leak and
a memory leak in expressions that allocated internal state, e.g.
nft_counter.

Fixes: 4094445229 ("netfilter: nf_tables: add elements with stateful expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:32 +01:00
Pablo Neira Ayuso
772f4e82b3 netfilter: nf_tables: fix double-free on set expression from the error path
After copying the expression to the set element extension, release the
expression and reset the pointer to avoid a double-free from the error
path.

Fixes: 4094445229 ("netfilter: nf_tables: add elements with stateful expressions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:31 +01:00
Pablo Neira Ayuso
65038428b2 netfilter: nf_tables: allow to specify stateful expression in set definition
This patch allows users to specify the stateful expression for the
elements in this set via NFTA_SET_EXPR. This new feature allows you to
turn on counters for all of the elements in this set.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:31 +01:00
Pablo Neira Ayuso
0c2a85edd1 netfilter: nf_tables: pass context to nft_set_destroy()
The patch that adds support for stateful expressions in set definitions
require this.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:31 +01:00
Pablo Neira Ayuso
c604cc691c netfilter: nf_tables: move nft_expr_clone() to nf_tables_api.c
Move the nft_expr_clone() helper function to the core.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:31 +01:00
Petr Machata
2c4b58dc75 net: sched: Fix hw_stats_type setting in pedit loop
In the commit referenced below, hw_stats_type of an entry is set for every
entry that corresponds to a pedit action. However, the assignment is only
done after the entry pointer is bumped, and therefore could overwrite
memory outside of the entries array.

The reason for this positioning may have been that the current entry's
hw_stats_type is already set above, before the action-type dispatch.
However, if there are no more actions, the assignment is wrong. And if
there are, the next round of the for_each_action loop will make the
assignment before the action-type dispatch anyway.

Therefore fix this issue by simply reordering the two lines.

Fixes: 74522e7baa ("net: sched: set the hw_stats_type in pedit loop")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18 16:52:04 -07:00
Paul Blakey
dd2af10402 net/sched: act_ct: Fix leak of ct zone template on replace
Currently, on replace, the previous action instance params
is swapped with a newly allocated params. The old params is
only freed (via kfree_rcu), without releasing the allocated
ct zone template related to it.

Call tcf_ct_params_free (via call_rcu) for the old params,
so it will release it.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18 16:37:06 -07:00
Daniel Borkmann
357b6cc583 netfilter: revert introduction of egress hook
This reverts the following commits:

  8537f78647 ("netfilter: Introduce egress hook")
  5418d3881e ("netfilter: Generalize ingress hook")
  b030f194ae ("netfilter: Rename ingress hook include file")

>From the discussion in [0], the author's main motivation to add a hook
in fast path is for an out of tree kernel module, which is a red flag
to begin with. Other mentioned potential use cases like NAT{64,46}
is on future extensions w/o concrete code in the tree yet. Revert as
suggested [1] given the weak justification to add more hooks to critical
fast-path.

  [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
  [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Miller <davem@davemloft.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-18 16:35:48 -07:00
Dmitry Grinberg
ba7c1b47c1 Bluetooth: Do not cancel advertising when starting a scan
BlueZ cancels adv when starting a scan, but does not cancel a scan when
starting to adv. Neither is required, so this brings both to a
consistent state (of not affecting each other). Some very rare (I've
never seen one) BT 4.0 chips will fail to do both at once. Even this is
ok since the command that will fail will be the second one, and thus the
common sense logic of first-come-first-served is preserved for BLE
requests.

Signed-off-by: Dmitry Grinberg <dmitrygr@google.com>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-18 12:25:03 +01:00
David S. Miller
a58741ef1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey.

2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi.
   Follow up patch to clean up this new mode, from Dan Carpenter.

3) Add support for geneve tunnel options, from Xin Long.

4) Make sets built-in and remove modular infrastructure for sets,
   from Florian Westphal.

5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing.

6) Statify nft_pipapo_get, from Chen Wandun.

7) Use C99 flexible-array member, from Gustavo A. R. Silva.

8) More descriptive variable names for bitwise, from Jeremy Sowden.

9) Four patches to add tunnel device hardware offload to the flowtable
   infrastructure, from wenxu.

10) pipapo set supports for 8-bit grouping, from Stefano Brivio.

11) pipapo can switch between nibble and byte grouping, also from
    Stefano.

12) Add AVX2 vectorized version of pipapo, from Stefano Brivio.

13) Update pipapo to be use it for single ranges, from Stefano.

14) Add stateful expression support to elements via control plane,
    eg. counter per element.

15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal.

15) Add new egress hook, from Lukas Wunner.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 23:51:31 -07:00
Mauro Carvalho Chehab
2de9780f75 net: core: dev.c: fix a documentation warning
There's a markup for link with is "foo_". On this kernel-doc
comment, we don't want this, but instead, place a literal
reference. So, escape the literal with ``foo``, in order to
avoid this warning:

	./net/core/dev.c:5195: WARNING: Unknown target name: "page_is".

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 23:39:29 -07:00
Paolo Abeni
7f20d5fc70 mptcp: move msk state update to subflow_syn_recv_sock()
After commit 58b0991962 ("mptcp: create msk early"), the
msk socket is already available at subflow_syn_recv_sock()
time. Let's move there the state update, to mirror more
closely the first subflow state.

The above will also help multiple subflow supports.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 22:52:24 -07:00
Nikolay Aleksandrov
569da08228 net: bridge: vlan options: add support for tunnel mapping set/del
This patch adds support for manipulating vlan/tunnel mappings. The
tunnel ids are globally unique and are one per-vlan. There were two
trickier issues - first in order to support vlan ranges we have to
compute the current tunnel id in the following way:
 - base tunnel id (attr) + current vlan id - starting vlan id
This is in line how the old API does vlan/tunnel mapping with ranges. We
already have the vlan range present, so it's redundant to add another
attribute for the tunnel range end. It's simply base tunnel id + vlan
range. And second to support removing mappings we need an out-of-band way
to tell the option manipulating function because there are no
special/reserved tunnel id values, so we use a vlan flag to denote the
operation is tunnel mapping removal.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 22:47:12 -07:00
Nikolay Aleksandrov
188c67dd19 net: bridge: vlan options: add support for tunnel id dumping
Add a new option - BRIDGE_VLANDB_ENTRY_TUNNEL_ID which is used to dump
the tunnel id mapping. Since they're unique per vlan they can enter a
vlan range if they're consecutive, thus we can calculate the tunnel id
range map simply as: vlan range end id - vlan range start id. The
starting point is the tunnel id in BRIDGE_VLANDB_ENTRY_TUNNEL_ID. This
is similar to how the tunnel entries can be created in a range via the
old API (a vlan range maps to a tunnel range).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 22:47:12 -07:00
Nikolay Aleksandrov
53e96632ab net: bridge: vlan tunnel: constify bridge and port arguments
The vlan tunnel code changes vlan options, it shouldn't touch port or
bridge options so we can constify the port argument. This would later help
us to re-use these functions from the vlan options code.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 22:47:12 -07:00
Nikolay Aleksandrov
99f7c5e096 net: bridge: vlan options: rename br_vlan_opts_eq to br_vlan_opts_eq_range
It is more appropriate name as it shows the intent of why we need to
check the options' state. It also allows us to give meaning to the two
arguments of the function: the first is the current vlan (v_curr) being
checked if it could enter the range ending in the second one (range_end).

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 22:47:12 -07:00
Eric Dumazet
583396f4ca net_sched: sch_fq: enable use of hrtimer slack
Add a new attribute to control the fq qdisc hrtimer slack.

Default is set to 10 usec.

When/if packets are throttled, fq set up an hrtimer that can
lead to one interrupt per packet in the throttled queue.

By using a timer slack, we allow better use of timer interrupts,
by giving them a chance to call multiple timer callbacks
at each hardware interrupt.

Also, giving a slack allows FQ to dequeue batches of packets
instead of a single one, thus increasing xmit_more efficiency.

This has no negative effect on the rate a TCP flow can sustain,
since each TCP flow maintains its own precise vtime (tp->tcp_wstamp_ns)

v2: added strict netlink checking (as feedback from Jakub Kicinski)

Tested:
 1000 concurrent flows all using paced packets.
 1,000,000 packets sent per second.

Before the patch :

$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 60726784  23628 3485992    0    0   138     1  977  535  0 12 87  0  0
 0  0      0 60714700  23628 3485628    0    0     0     0 1568827 26462  0 22 78  0  0
 1  0      0 60716012  23628 3485656    0    0     0     0 1570034 26216  0 22 78  0  0
 0  0      0 60722420  23628 3485492    0    0     0     0 1567230 26424  0 22 78  0  0
 0  0      0 60727484  23628 3485556    0    0     0     0 1568220 26200  0 22 78  0  0
 2  0      0 60718900  23628 3485380    0    0     0    40 1564721 26630  0 22 78  0  0
 2  0      0 60718096  23628 3485332    0    0     0     0 1562593 26432  0 22 78  0  0
 0  0      0 60719608  23628 3485064    0    0     0     0 1563806 26238  0 22 78  0  0
 1  0      0 60722876  23628 3485236    0    0     0   130 1565874 26566  0 22 78  0  0
 1  0      0 60722752  23628 3484908    0    0     0     0 1567646 26247  0 22 78  0  0

After the patch, slack of 10 usec, we can see a reduction of interrupts
per second, and a small decrease of reported cpu usage.

$ vmstat 2 10
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 1  0      0 60722564  23628 3484728    0    0   133     1  696  545  0 13 87  0  0
 1  0      0 60722568  23628 3484824    0    0     0     0 977278 25469  0 20 80  0  0
 0  0      0 60716396  23628 3484764    0    0     0     0 979997 25326  0 20 80  0  0
 0  0      0 60713844  23628 3484960    0    0     0     0 981394 25249  0 20 80  0  0
 2  0      0 60720468  23628 3484916    0    0     0     0 982860 25062  0 20 80  0  0
 1  0      0 60721236  23628 3484856    0    0     0     0 982867 25100  0 20 80  0  0
 1  0      0 60722400  23628 3484456    0    0     0     8 982698 25303  0 20 80  0  0
 0  0      0 60715396  23628 3484428    0    0     0     0 981777 25176  0 20 80  0  0
 0  0      0 60716520  23628 3486544    0    0     0    36 978965 27857  0 21 79  0  0
 0  0      0 60719592  23628 3486516    0    0     0    22 977318 25106  0 20 80  0  0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 21:16:35 -07:00
Eric Dumazet
b88948fbc7 net_sched: do not reprogram a timer about to expire
qdisc_watchdog_schedule_range_ns() can use the newly added slack
and avoid rearming the hrtimer a bit earlier than the current
value. This patch has no effect if delta_ns parameter
is zero.

Note that this means the max slack is potentially doubled.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 21:16:35 -07:00
Eric Dumazet
efe074c2cc net_sched: add qdisc_watchdog_schedule_range_ns()
Some packet schedulers might want to add a slack
when programming hrtimers. This can reduce number
of interrupts and increase batch sizes and thus
give good xmit_more savings.

This commit adds qdisc_watchdog_schedule_range_ns()
helper, with an extra delta_ns parameter.

Legacy qdisc_watchdog_schedule_n() becomes an inline
passing a zero slack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 21:16:34 -07:00
Jakub Kicinski
53eca1f347 net: rename flow_action_hw_stats_types* -> flow_action_hw_stats*
flow_action_hw_stats_types_check() helper takes one of the
FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
the arguments to the opening bracket of the helper there
is no way to call this helper and stay under 80 characters.

Remove the "types" part from the new flow_action helpers
and enum values.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 21:12:39 -07:00
Jakub Kicinski
9000edb71a net: ethtool: require drivers to set supported_coalesce_params
Now that all in-tree drivers have been updated we can
make the supported_coalesce_params mandatory.

To save debugging time in case some driver was missed
(or is out of tree) add a warning when netdev is registered
with set_coalesce but without supported_coalesce_params.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 20:56:58 -07:00
Lukas Wunner
8537f78647 netfilter: Introduce egress hook
Commit e687ad60af ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.

Allow the same on egress.  Position the hook immediately before a packet
is handed to tc and then sent out on an interface, thereby mirroring the
ingress order.  This order allows marking packets in the netfilter
egress hook and subsequently using the mark in tc.  Another benefit of
this order is consistency with a lot of existing documentation which
says that egress tc is performed after netfilter hooks.

Egress hooks already exist for the most common protocols, such as
NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
they are executed earlier during packet processing.  However for more
exotic protocols, there is currently no provision to apply netfilter on
egress.  A common workaround is to enslave the interface to a bridge and
use ebtables, or to resort to tc.  But when the ingress hook was
introduced, consensus was that users should be given the choice to use
netfilter or tc, whichever tool suits their needs best:
https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
This hook is also useful for NAT46/NAT64, tunneling and filtering of
locally generated af_packet traffic such as dhclient.

There have also been occasional user requests for a netfilter egress
hook in the past, e.g.:
https://www.spinics.net/lists/netfilter/msg50038.html

Performance measurements with pktgen surprisingly show a speedup rather
than a slowdown with this commit:

* Without this commit:
  Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
  2920481pps 1401Mb/sec (1401830880bps) errors: 0

* With this commit:
  Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
  2941410pps 1411Mb/sec (1411876800bps) errors: 0

* Without this commit + tc egress:
  Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
  2562631pps 1230Mb/sec (1230062880bps) errors: 0

* With this commit + tc egress:
  Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
  2659259pps 1276Mb/sec (1276444320bps) errors: 0

* With this commit + nft egress:
  Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
  2413320pps 1158Mb/sec (1158393600bps) errors: 0

Tested on a bare-metal Core i7-3615QM, each measurement was performed
three times to verify that the numbers are stable.

Commands to perform a measurement:
modprobe pktgen
echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000

Commands for testing tc egress:
tc qdisc add dev lo clsact
tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32

Commands for testing nft egress:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop

All testing was performed on the loopback interface to avoid distorting
measurements by the packet handling in the low-level Ethernet driver.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18 01:20:15 +01:00
Lukas Wunner
5418d3881e netfilter: Generalize ingress hook
Prepare for addition of a netfilter egress hook by generalizing the
ingress hook introduced by commit e687ad60af ("netfilter: add
netfilter ingress hook after handle_ing() under unique static key").

In particular, rename and refactor the ingress hook's static inlines
such that they can be reused for an egress hook.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18 01:20:09 +01:00
Lukas Wunner
b030f194ae netfilter: Rename ingress hook include file
Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.

The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.

No functional change intended.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-18 01:20:04 +01:00
Martin KaFai Lau
8e7ae2518f bpf: Sanitize the bpf_struct_ops tcp-cc name
The bpf_struct_ops tcp-cc name should be sanitized in order to
avoid problematic chars (e.g. whitespaces).

This patch reuses the bpf_obj_name_cpy() for accepting the same set
of characters in order to keep a consistent bpf programming experience.
A "size" param is added.  Also, the strlen is returned on success so
that the caller (like the bpf_tcp_ca here) can error out on empty name.
The existing callers of the bpf_obj_name_cpy() only need to change the
testing statement to "if (err < 0)".  For all these existing callers,
the err will be overwritten later, so no extra change is needed
for the new strlen return value.

v3:
  - reverse xmas tree style
v2:
  - Save the orig_src to avoid "end - size" (Andrii)

Fixes: 0baf26b0fc ("bpf: tcp: Support tcp_congestion_ops in bpf")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200314010209.1131542-1-kafai@fb.com
2020-03-17 20:40:19 +01:00
Pengcheng Yang
fa4cb9eba3 tcp: fix stretch ACK bugs in Yeah
Change Yeah to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().

In addition, we re-implemented the scalable path using
tcp_cong_avoid_ai() and removed the pkts_acked variable.

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 18:26:55 -07:00
Pengcheng Yang
ca04f5d4bb tcp: fix stretch ACK bugs in Veno
Change Veno to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 18:26:55 -07:00
Pengcheng Yang
d861b5c753 tcp: stretch ACK fixes in Veno prep
No code logic has been changed in this patch.

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 18:26:55 -07:00
Pengcheng Yang
5415e3c37a tcp: fix stretch ACK bugs in Scalable
Change Scalable to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets to
tcp_cong_avoid_ai().

In addition, because we are now precisely accounting for
stretch ACKs, including delayed ACKs, we can now change
TCP_SCALABLE_AI_CNT to 100.

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 18:26:54 -07:00
Pengcheng Yang
be0d935ebf tcp: fix stretch ACK bugs in BIC
Changes BIC to properly handle stretch ACKs in additive
increase mode by passing in the count of ACKed packets
to tcp_cong_avoid_ai().

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 18:26:54 -07:00
Petr Machata
32ca98feab net: ip_gre: Accept IFLA_INFO_DATA-less configuration
The fix referenced below causes a crash when an ERSPAN tunnel is created
without passing IFLA_INFO_DATA. Fix by validating passed-in data in the
same way as ipgre does.

Fixes: e1f8f78ffe ("net: ip_gre: Separate ERSPAN newlink / changelink callbacks")
Reported-by: syzbot+1b4ebf4dae4e510dd219@syzkaller.appspotmail.com
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 17:19:56 -07:00
Madhuparna Bhowmik
1963507e62 net: kcm: kcmproc.c: Fix RCU list suspicious usage warning
This path fixes the suspicious RCU usage warning reported by
kernel test robot.

net/kcm/kcmproc.c:#RCU-list_traversed_in_non-reader_section

There is no need to use list_for_each_entry_rcu() in
kcm_stats_seq_show() as the list is always traversed under
knet->mutex held.

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 17:14:02 -07:00
Trond Myklebust
78a947f50a sunrpc: Add tracing for cache events
Add basic tracing for debugging the sunrpc cache events.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:34 -04:00
Trond Myklebust
277f27e2f2 SUNRPC/cache: Allow garbage collection of invalid cache entries
If the cache entry never gets initialised, we want the garbage
collector to be able to evict it. Otherwise if the upcall daemon
fails to initialise the entry, we end up never expiring it.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
[ cel: resolved a merge conflict ]
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:34 -04:00
Trond Myklebust
65286b883c nfsd: export upcalls must not return ESTALE when mountd is down
If the rpc.mountd daemon goes down, then that should not cause all
exports to start failing with ESTALE errors. Let's explicitly
distinguish between the cache upcall cases that need to time out,
and those that do not.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
da1661b93b SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends
xprt_sock_sendmsg uses the more efficient iov_iter-enabled kernel
socket API, and is a pre-requisite for server send-side support for
TLS.

Note that svc_process no longer needs to reserve a word for the
stream record marker, since the TCP transport now provides the
record marker automatically in a separate buffer.

The dprintk() in svc_send_common is also removed. It didn't seem
crucial for field troubleshooting. If more is needed there, a trace
point could be added in xprt_sock_sendmsg().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
9e55eef4ab SUNRPC: Refactor xs_sendpages()
Re-locate xs_sendpages() so that it can be shared with server code.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
0dabe948f2 svcrdma: Avoid DMA mapping small RPC Replies
On some platforms, DMA mapping part of a page is more costly than
copying bytes. Indeed, not involving the I/O MMU can help the
RPC/RDMA transport scale better for tiny I/Os across more RDMA
devices. This is because interaction with the I/O MMU is eliminated
for each of these small I/Os. Without the explicit unmapping, the
NIC no longer needs to do a costly internal TLB shoot down for
buffers that are just a handful of bytes.

Since pull-up is now a more a frequent operation, I've introduced a
trace point in the pull-up path. It can be used for debugging or
user-space tools that count pull-up frequency.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
aee4b74a3f svcrdma: Fix double sync of transport header buffer
Performance optimization: Avoid syncing the transport buffer twice
when Reply buffer pull-up is necessary.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
6fd5034db4 svcrdma: Refactor chunk list encoders
Same idea as the receive-side changes I did a while back: use
xdr_stream helpers rather than open-coding the XDR chunk list
encoders. This builds the Reply transport header from beginning to
end without backtracking.

As additional clean-ups, fill in documenting comments for the XDR
encoders and sprinkle some trace points in the new encoding
functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:33 -04:00
Chuck Lever
5c266df527 SUNRPC: Add encoders for list item discriminators
Clean up. These are taken from the client-side RPC/RDMA transport
to a more global header file so they can be used elsewhere.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
a406c563e8 svcrdma: Rename svcrdma_encode trace points in send routines
These trace points are misnamed:

	trace_svcrdma_encode_wseg
	trace_svcrdma_encode_write
	trace_svcrdma_encode_reply
	trace_svcrdma_encode_rseg
	trace_svcrdma_encode_read
	trace_svcrdma_encode_pzr

Because they actually trace posting on the Send Queue. Let's rename
them so that I can add trace points in the chunk list encoders that
actually do trace chunk list encoding events.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
db9602e404 svcrdma: Update synopsis of svc_rdma_send_reply_msg()
Preparing for subsequent patches, no behavior change expected.

Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
4554755ed8 svcrdma: Update synopsis of svc_rdma_map_reply_msg()
Preparing for subsequent patches, no behavior change expected.

Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into those lower-level functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
6fa5785e78 svcrdma: Update synopsis of svc_rdma_send_reply_chunk()
Preparing for subsequent patches, no behavior change expected.

Pass the RPC Call's svc_rdma_recv_ctxt deeper into the sendto()
path. This enables passing more information about Requester-
provided Write and Reply chunks into the lower-level send
functions.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
2fe8c44633 svcrdma: De-duplicate code that locates Write and Reply chunks
Cache the locations of the Requester-provided Write list and Reply
chunk so that the Send path doesn't need to parse the Call header
again.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
e604aad2ca svcrdma: Use struct xdr_stream to decode ingress transport headers
The logic that checks incoming network headers has to be scrupulous.

De-duplicate: replace open-coded buffer overflow checks with the use
of xdr_stream helpers that are used most everywhere else XDR
decoding is done.

One minor change to the sanity checks: instead of checking the
length of individual segments, cap the length of the whole chunk
to be sure it can fit in the set of pages available in rq_pages.
This should be a better test of whether the server can handle the
chunks in each request.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
2426ddfdf1 svcrdma: Remove svcrdma_cm_event() trace point
Clean up. This trace point is no longer needed because the RDMA/core
CMA code has an equivalent trace point that was added by commit
ed999f820a ("RDMA/cma: Add trace points in RDMA Connection
Manager").

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
b20dfc3fcd svcrdma: Create a generic tracing class for displaying xdr_buf layout
This class can be used to create trace points in either the RPC
client or RPC server paths. It simply displays the length of each
part of an xdr_buf, which is useful to determine that the transport
and XDR codecs are operating correctly.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:32 -04:00
Chuck Lever
28155524ea SUNRPC: Clean up: Replace dprintk and BUG_ON call sites in svcauth_gss.c
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Chuck Lever
96f194b715 SUNRPC: Add xdr_pad_size() helper
Introduce a helper function to compute the XDR pad size of a
variable-length XDR object.

Clean up: Replace open-coded calculation of XDR pad sizes.
I'm sure I haven't found every instance of this calculation.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Chuck Lever
758a3bf945 svcrdma: Fix double svc_rdma_send_ctxt_put() in an error path
This error path is almost never executed. Found by code inspection.

Fixes: 99722fe4d5 ("svcrdma: Persistently allocate and DMA-map Send buffers")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Chuck Lever
412055398b nfsd: Fix NFSv4 READ on RDMA when using readv
svcrdma expects that the payload falls precisely into the xdr_buf
page vector. This does not seem to be the case for
nfsd4_encode_readv().

This code is called only when fops->splice_read is missing or when
RQ_SPLICE_OK is clear, so it's not a noticeable problem in many
common cases.

Add new transport method: ->xpo_read_payload so that when a READ
payload does not fit exactly in rq_res's page vector, the XDR
encoder can inform the RPC transport exactly where that payload is,
without the payload's XDR pad.

That way, when a Write chunk is present, the transport knows what
byte range in the Reply message is supposed to be matched with the
chunk.

Note that the Linux NFS server implementation of NFS/RDMA can
currently handle only one Write chunk per RPC-over-RDMA message.
This simplifies the implementation of this fix.

Fixes: b042098063 ("nfsd4: allow exotic read compounds")
Buglink: https://bugzilla.kernel.org/show_bug.cgi?id=198053
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Gustavo A. R. Silva
c0fb23f867 svcrdma: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Amol Grover
51cae673d0 sunrpc: Pass lockdep expression to RCU lists
detail->hash_table[] is traversed using hlist_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of detail->hash_lock.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-03-16 12:04:31 -04:00
Chuck Lever
d162372af3 SUNRPC: Trim stack utilization in the wrap and unwrap paths
By preventing compiler inlining of the integrity and privacy
helpers, stack utilization for the common case (authentication only)
goes way down.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 10:18:45 -04:00
Chuck Lever
8d6bda7f23 SUNRPC: Remove xdr_buf_read_mic()
Clean up: this function is no longer used.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 10:18:45 -04:00
Chuck Lever
4047aa909c sunrpc: Fix gss_unwrap_resp_integ() again
xdr_buf_read_mic() tries to find unused contiguous space in a
received xdr_buf in order to linearize the checksum for the call
to gss_verify_mic. However, the corner cases in this code are
numerous and we seem to keep missing them. I've just hit yet
another buffer overrun related to it.

This overrun is at the end of xdr_buf_read_mic():

1284         if (buf->tail[0].iov_len != 0)
1285                 mic->data = buf->tail[0].iov_base + buf->tail[0].iov_len;
1286         else
1287                 mic->data = buf->head[0].iov_base + buf->head[0].iov_len;
1288         __read_bytes_from_xdr_buf(&subbuf, mic->data, mic->len);
1289         return 0;

This logic assumes the transport has set the length of the tail
based on the size of the received message. base + len is then
supposed to be off the end of the message but still within the
actual buffer.

In fact, the length of the tail is set by the upper layer when the
Call is encoded so that the end of the tail is actually the end of
the allocated buffer itself. This causes the logic above to set
mic->data to point past the end of the receive buffer.

The "mic->data = head" arm of this if statement is no less fragile.

As near as I can tell, this has been a problem forever. I'm not sure
that minimizing au_rslack recently changed this pathology much.

So instead, let's use a more straightforward approach: kmalloc a
separate buffer to linearize the checksum. This is similar to
how gss_validate() currently works.

Coming back to this code, I had some trouble understanding what
was going on. So I've cleaned up the variable naming and added
a few comments that point back to the XDR definition in RFC 2203
to help guide future spelunkers, including myself.

As an added clean up, the functionality that was in
xdr_buf_read_mic() is folded directly into gss_unwrap_resp_integ(),
as that is its only caller.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Reviewed-by: Benjamin Coddington <bcodding@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 10:18:44 -04:00
Colin Ian King
68e9a2463d SUNRPC: remove redundant assignments to variable status
The variable status is being initialized with a value that is never
read and it is being updated later with a new value.  The initialization
is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 10:10:36 -04:00
Trond Myklebust
263fb9c21e SUNRPC: Don't take a reference to the cred on synchronous tasks
If the RPC call is synchronous, assume the cred is already pinned
by the caller.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 08:34:29 -04:00
Trond Myklebust
7eac52648a SUNRPC: Add a flag to avoid reference counts on credentials
Add a flag to signal to the RPC layer that the credential is already
pinned for the duration of the RPC call.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-03-16 08:34:28 -04:00
Torsten Hilbrich
2a9de3af21 vti6: Fix memory leak of skb if input policy check fails
The vti6_rcv function performs some tests on the retrieved tunnel
including checking the IP protocol, the XFRM input policy, the
source and destination address.

In all but one places the skb is released in the error case. When
the input policy check fails the network packet is leaked.

Using the same goto-label discard in this case to fix this problem.

Fixes: ed1efb2aef ("ipv6: Add support for IPsec virtual tunnel interfaces")
Signed-off-by: Torsten Hilbrich <torsten.hilbrich@secunet.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-16 11:13:48 +01:00
Jiri Pirko
74522e7baa net: sched: set the hw_stats_type in pedit loop
For a single pedit action, multiple offload entries may be used. Set the
hw_stats_type to all of them.

Fixes: 44f8658017 ("sched: act: allow user to specify type of HW stats for a filter")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 02:13:43 -07:00
Michal Kubecek
2363d73a2f ethtool: reject unrecognized request flags
As pointed out by Jakub Kicinski, we ethtool netlink code should respond
with an error if request head has flags set which are not recognized by
kernel, either as a mistake or because it expects functionality introduced
in later kernel versions.

To avoid unnecessary roundtrips, use extack cookie to provide the
information about supported request flags.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 02:04:24 -07:00
Michal Kubecek
fe2a31d790 netlink: allow extack cookie also for error messages
Commit ba0dc5f6e0 ("netlink: allow sending extended ACK with cookie on
success") introduced a cookie which can be sent to userspace as part of
extended ack message in the form of NLMSGERR_ATTR_COOKIE attribute.
Currently the cookie is ignored if error code is non-zero but there is
no technical reason for such limitation and it can be useful to provide
machine parseable information as part of an error message.

Include NLMSGERR_ATTR_COOKIE whenever the cookie has been set,
regardless of error code.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 02:04:24 -07:00
Cong Wang
ef299cc3fa net_sched: cls_route: remove the right filter from hashtable
route4_change() allocates a new filter and copies values from
the old one. After the new filter is inserted into the hash
table, the old filter should be removed and freed, as the final
step of the update.

However, the current code mistakenly removes the new one. This
looks apparently wrong to me, and it causes double "free" and
use-after-free too, as reported by syzbot.

Reported-and-tested-by: syzbot+f9b32aaacd60305d9687@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+2f8c233f131943d6056d@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+9c2df9fd5e9445b74e01@syzkaller.appspotmail.com
Fixes: 1109c00547 ("net: sched: RCU cls_route")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 01:59:32 -07:00
Taehee Yoo
09e91dbea0 hsr: set .netnsok flag
The hsr module has been supporting the list and status command.
(HSR_C_GET_NODE_LIST and HSR_C_GET_NODE_STATUS)
These commands send node information to the user-space via generic netlink.
But, in the non-init_net namespace, these commands are not allowed
because .netnsok flag is false.
So, there is no way to get node information in the non-init_net namespace.

Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 01:46:09 -07:00
Taehee Yoo
ca19c70f52 hsr: add restart routine into hsr_get_node_list()
The hsr_get_node_list() is to send node addresses to the userspace.
If there are so many nodes, it could fail because of buffer size.
In order to avoid this failure, the restart routine is added.

Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 01:46:09 -07:00
Taehee Yoo
173756b868 hsr: use rcu_read_lock() in hsr_get_node_{list/status}()
hsr_get_node_{list/status}() are not under rtnl_lock() because
they are callback functions of generic netlink.
But they use __dev_get_by_index() without rtnl_lock().
So, it would use unsafe data.
In order to fix it, rcu_read_lock() and dev_get_by_index_rcu()
are used instead of __dev_get_by_index().

Fixes: f421436a59 ("net/hsr: Add support for the High-availability Seamless Redundancy protocol (HSRv0)")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 01:46:09 -07:00
Russell King
87615c96e7 net: dsa: warn if phylink_mac_link_state returns error
Issue a warning to the kernel log if phylink_mac_link_state() returns
an error. This should not occur, but let's make it visible.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 17:11:12 -07:00
Florian Westphal
d0febd81ae netfilter: conntrack: re-visit sysctls in unprivileged namespaces
since commit b884fa4617 ("netfilter: conntrack: unify sysctl handling")
conntrack no longer exposes most of its sysctls (e.g. tcp timeouts
settings) to network namespaces that are not owned by the initial user
namespace.

This patch exposes all sysctls even if the namespace is unpriviliged.

compared to a 4.19 kernel, the newly visible and writeable sysctls are:
  net.netfilter.nf_conntrack_acct
  net.netfilter.nf_conntrack_timestamp
  .. to allow to enable accouting and timestamp extensions.

  net.netfilter.nf_conntrack_events
  .. to turn off conntrack event notifications.

  net.netfilter.nf_conntrack_checksum
  .. to disable checksum validation.

  net.netfilter.nf_conntrack_log_invalid
  .. to enable logging of packets deemed invalid by conntrack.

newly visible sysctls that are only exported as read-only:

  net.netfilter.nf_conntrack_count
  .. current number of conntrack entries living in this netns.

  net.netfilter.nf_conntrack_max
  .. global upperlimit (maximum size of the table).

  net.netfilter.nf_conntrack_buckets
  .. size of the conntrack table (hash buckets).

  net.netfilter.nf_conntrack_expect_max
  .. maximum number of permitted expectations in this netns.

  net.netfilter.nf_conntrack_helper
  .. conntrack helper auto assignment.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:51 +01:00
Pablo Neira Ayuso
339706bc21 netfilter: nft_lookup: update element stateful expression
If the set element comes with an stateful expression, update it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:50 +01:00
Pablo Neira Ayuso
76adfafeca netfilter: nf_tables: add nft_set_elem_update_expr() helper function
This helper function runs the eval path of the stateful expression
of an existing set element.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:49 +01:00
Pablo Neira Ayuso
4094445229 netfilter: nf_tables: add elements with stateful expressions
Update nft_add_set_elem() to handle the NFTA_SET_ELEM_EXPR netlink
attribute. This patch allows users to to add elements with stateful
expressions.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:49 +01:00
Pablo Neira Ayuso
795a6d6b42 netfilter: nf_tables: statify nft_expr_init()
Not exposed anymore to modules, statify this function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:48 +01:00
Pablo Neira Ayuso
a7fc936804 netfilter: nf_tables: add nft_set_elem_expr_alloc()
Add helper function to create stateful expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:47 +01:00
Stefano Brivio
eb16933aa5 nft_set_pipapo: Prepare for single ranged field usage
A few adjustments in nft_pipapo_init() are needed to allow usage of
this set back-end for a single, ranged field.

Provide a convenient NFT_PIPAPO_MIN_FIELDS definition that currently
makes sure that the rbtree back-end is selected instead, for sets
with a single field.

This finally allows a fair comparison with rbtree sets, by defining
NFT_PIPAPO_MIN_FIELDS as 0 and skipping rbtree back-end initialisation:

 ---------------.--------------------------.-------------------------.
 AMD Epyc 7402  |      baselines, Mpps     |   Mpps, % over rbtree   |
  1 thread      |__________________________|_________________________|
  3.35GHz       |        |        |        |            |            |
  768KiB L1D$   | netdev |  hash  | rbtree |            |   pipapo   |
 ---------------|  hook  |   no   | single |   pipapo   |single field|
 type   entries |  drop  | ranges | field  |single field|    AVX2    |
 ---------------|--------|--------|--------|------------|------------|
 net,port       |        |        |        |            |            |
          1000  |   19.0 |   10.4 |    3.8 | 6.0   +58% | 9.6  +153% |
 ---------------|--------|--------|--------|------------|------------|
 port,net       |        |        |        |            |            |
           100  |   18.8 |   10.3 |    5.8 | 9.1   +57% |11.6  +100% |
 ---------------|--------|--------|--------|------------|------------|
 net6,port      |        |        |        |            |            |
          1000  |   16.4 |    7.6 |    1.8 | 2.8   +55% | 6.5  +261% |
 ---------------|--------|--------|--------|------------|------------|
 port,proto     |        |        |        |     [1]    |    [1]     |
         30000  |   19.6 |   11.6 |    3.9 | 0.9   -77% | 2.7   -31% |
 ---------------|--------|--------|--------|------------|------------|
 port,proto     |        |        |        |            |            |
         10000  |   19.6 |   11.6 |    4.4 | 2.1   -52% | 5.6   +27% |
 ---------------|--------|--------|--------|------------|------------|
 port,proto     |        |        |        |            |            |
 4 threads 10000|   77.9 |   45.1 |   17.4 | 8.3   -52% |22.4   +29% |
 ---------------|--------|--------|--------|------------|------------|
 net6,port,mac  |        |        |        |            |            |
            10  |   16.5 |    5.4 |    4.3 | 4.5    +5% | 8.2   +91% |
 ---------------|--------|--------|--------|------------|------------|
 net6,port,mac, |        |        |        |            |            |
 proto    1000  |   16.5 |    5.7 |    1.9 | 2.8   +47% | 6.6  +247% |
 ---------------|--------|--------|--------|------------|------------|
 net,mac        |        |        |        |            |            |
          1000  |   19.0 |    8.4 |    3.9 | 6.0   +54% | 9.9  +154% |
 ---------------'--------'--------'--------'------------'------------'
 [1] Causes switch of lookup table buckets for 'port' to 4-bit groups

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:46 +01:00
Stefano Brivio
7400b06396 nft_set_pipapo: Introduce AVX2-based lookup implementation
If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports, averaged over five runs on a single AMD Epyc
7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
CONFIG_RETPOLINE was not set here.

Note that this is not a fair comparison over hash and rbtree set
types: non-ranged entries (used to have a reference for hash types)
would be matched faster than this, and matching on a single field
only (which is the case for rbtree) is also significantly faster.

However, it's not possible at the moment to choose this set type
for non-ranged entries, and the current implementation also needs
a few minor adjustments in order to match on less than two fields.

 ---------------.-----------------------------------.------------.
 AMD Epyc 7402  |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  3.35GHz       |        |        |        |        |            |
  768KiB L1D$   | netdev |  hash  | rbtree |        |            |
 ---------------|  hook  |   no   | single |        |   pipapo   |
 type   entries |  drop  | ranges | field  | pipapo |    AVX2    |
 ---------------|--------|--------|--------|--------|------------|
 net,port       |        |        |        |        |            |
          1000  |   19.0 |   10.4 |    3.8 |    4.0 | 7.5   +87% |
 ---------------|--------|--------|--------|--------|------------|
 port,net       |        |        |        |        |            |
           100  |   18.8 |   10.3 |    5.8 |    6.3 | 8.1   +29% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port      |        |        |        |        |            |
          1000  |   16.4 |    7.6 |    1.8 |    2.1 | 4.8  +128% |
 ---------------|--------|--------|--------|--------|------------|
 port,proto     |        |        |        |        |            |
         30000  |   19.6 |   11.6 |    3.9 |    0.5 | 2.6  +420% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac  |        |        |        |        |            |
            10  |   16.5 |    5.4 |    4.3 |    3.4 | 4.7   +38% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac, |        |        |        |        |            |
 proto    1000  |   16.5 |    5.7 |    1.9 |    1.4 | 3.6   +26% |
 ---------------|--------|--------|--------|--------|------------|
 net,mac        |        |        |        |        |            |
          1000  |   19.0 |    8.4 |    3.9 |    2.5 | 6.4  +156% |
 ---------------'--------'--------'--------'--------'------------'

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:45 +01:00
Stefano Brivio
8683f4b995 nft_set_pipapo: Prepare for vectorised implementation: helpers
Move most macros and helpers to a header file, so that they can be
conveniently used by related implementations.

No functional changes are intended here.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:44 +01:00
Stefano Brivio
bf3e583923 nft_set_pipapo: Prepare for vectorised implementation: alignment
SIMD vector extension sets require stricter alignment than native
instruction sets to operate efficiently (AVX, NEON) or for some
instructions to work at all (AltiVec).

Provide facilities to define arbitrary alignment for lookup tables
and scratch maps. By defining byte alignment with NFT_PIPAPO_ALIGN,
lt_aligned and scratch_aligned pointers become available.

Additional headroom is allocated, and pointers to the possibly
unaligned, originally allocated areas are kept so that they can
be freed.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:43 +01:00
Stefano Brivio
4051f43116 nft_set_pipapo: Add support for 8-bit lookup groups and dynamic switch
While grouping matching bits in groups of four saves memory compared
to the more natural choice of 8-bit words (lookup table size is one
eighth), it comes at a performance cost, as the number of lookup
comparisons is doubled, and those also needs bitshifts and masking.

Introduce support for 8-bit lookup groups, together with a mapping
mechanism to dynamically switch, based on defined per-table size
thresholds and hysteresis, between 8-bit and 4-bit groups, as tables
grow and shrink. Empty sets start with 8-bit groups, and per-field
tables are converted to 4-bit groups if they get too big.

An alternative approach would have been to swap per-set lookup
operation functions as needed, but this doesn't allow for different
group sizes in the same set, which looks desirable if some fields
need significantly more matching data compared to others due to
heavier impact of ranges (e.g. a big number of subnets with
relatively simple port specifications).

Allowing different group sizes for the same lookup functions implies
the need for further conditional clauses, whose cost, however,
appears to be negligible in tests.

The matching rate figures below were obtained for x86_64 running
the nft_concat_range.sh "performance" cases, averaged over five
runs, on a single thread of an AMD Epyc 7402 CPU, and for aarch64
on a single thread of a BCM2711 (Raspberry Pi 4 Model B 4GB),
clocked at a stable 2147MHz frequency:

---------------.-----------------------------------.------------.
AMD Epyc 7402  |          baselines, Mpps          | this patch |
 1 thread      |___________________________________|____________|
 3.35GHz       |        |        |        |        |            |
 768KiB L1D$   | netdev |  hash  | rbtree |        |            |
---------------|  hook  |   no   | single | pipapo |   pipapo   |
type   entries |  drop  | ranges | field  | 4 bits | bit switch |
---------------|--------|--------|--------|--------|------------|
net,port       |        |        |        |        |            |
         1000  |   19.0 |   10.4 |    3.8 |    2.8 | 4.0   +43% |
---------------|--------|--------|--------|--------|------------|
port,net       |        |        |        |        |            |
          100  |   18.8 |   10.3 |    5.8 |    5.5 | 6.3   +14% |
---------------|--------|--------|--------|--------|------------|
net6,port      |        |        |        |        |            |
         1000  |   16.4 |    7.6 |    1.8 |    1.3 | 2.1   +61% |
---------------|--------|--------|--------|--------|------------|
port,proto     |        |        |        |        |     [1]    |
        30000  |   19.6 |   11.6 |    3.9 |    0.3 | 0.5   +66% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac  |        |        |        |        |            |
           10  |   16.5 |    5.4 |    4.3 |    2.6 | 3.4   +31% |
---------------|--------|--------|--------|--------|------------|
net6,port,mac, |        |        |        |        |            |
proto    1000  |   16.5 |    5.7 |    1.9 |    1.0 | 1.4   +40% |
---------------|--------|--------|--------|--------|------------|
net,mac        |        |        |        |        |            |
         1000  |   19.0 |    8.4 |    3.9 |    1.7 | 2.5   +47% |
---------------'--------'--------'--------'--------'------------'
[1] Causes switch of lookup table buckets for 'port', not 'proto',
    to 4-bit groups

 ---------------.-----------------------------------.------------.
 BCM2711        |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  2147MHz       |        |        |        |        |            |
  32KiB L1D$    | netdev |  hash  | rbtree |        |            |
 ---------------|  hook  |   no   | single | pipapo |   pipapo   |
 type   entries |  drop  | ranges | field  | 4 bits | bit switch |
 ---------------|--------|--------|--------|--------|------------|
 net,port       |        |        |        |        |            |
          1000  |   1.63 |   1.37 |   0.87 |   0.61 | 0.70  +17% |
 ---------------|--------|--------|--------|--------|------------|
 port,net       |        |        |        |        |            |
           100  |   1.64 |   1.36 |   1.02 |   0.78 | 0.81   +4% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port      |        |        |        |        |            |
          1000  |   1.56 |   1.27 |   0.65 |   0.34 | 0.50  +47% |
 ---------------|--------|--------|--------|--------|------------|
 port,proto [2] |        |        |        |        |            |
         10000  |   1.68 |   1.43 |   0.84 |   0.30 | 0.40  +13% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac  |        |        |        |        |            |
            10  |   1.56 |   1.14 |   1.02 |   0.62 | 0.66   +6% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac, |        |        |        |        |            |
 proto    1000  |   1.56 |   1.12 |   0.64 |   0.27 | 0.40  +48% |
 ---------------|--------|--------|--------|--------|------------|
 net,mac        |        |        |        |        |            |
          1000  |   1.63 |   1.26 |   0.87 |   0.41 | 0.53  +29% |
 ---------------'--------'--------'--------'--------'------------'
[2] Using 10000 entries instead of 30000 as it would take way too
    long for the test script to generate all of them

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:43 +01:00
Stefano Brivio
e807b13cb3 nft_set_pipapo: Generalise group size for buckets
Get rid of all hardcoded assumptions that buckets in lookup tables
correspond to four-bit groups, and replace them with appropriate
calculations based on a variable group size, now stored in struct
field.

The group size could now be in principle any divisor of eight. Note,
though, that lookup and get functions need an implementation
intimately depending on the group size, and the only supported size
there, currently, is four bits, which is also the initial and only
used size at the moment.

While at it, drop 'groups' from struct nft_pipapo: it was never used.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:42 +01:00
wenxu
88bf6e4114 netfilter: flowtable: add tunnel encap/decap action offload support
This patch add tunnel encap decap action offload in the flowtable
offload.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:23 +01:00
wenxu
cfab6dbd0e netfilter: flowtable: add tunnel match offload support
This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and
tunnel_dst match for flowtable offload

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:26:17 +01:00
wenxu
b5140a36da netfilter: flowtable: add indr block setup support
Add etfilter flowtable support indr-block setup. It makes flowtable offload
vlan and tunnel device.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:22:50 +01:00
wenxu
4679877921 netfilter: flowtable: add nf_flow_table_block_offload_init()
Add nf_flow_table_block_offload_init prepare for the indr block
offload patch

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:22:32 +01:00
Dan Carpenter
f628c27d85 netfilter: xt_IDLETIMER: clean up some indenting
These lines were indented wrong so Smatch complained.
net/netfilter/xt_IDLETIMER.c:81 idletimer_tg_show() warn: inconsistent indenting

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:17 +01:00
Jeremy Sowden
049dee95f8 netfilter: bitwise: use more descriptive variable-names.
Name the mask and xor data variables, "mask" and "xor," instead of "d1"
and "d2."

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Gustavo A. R. Silva
6daf141401 netfilter: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Chen Wandun
eb9d7af3b7 netfilter: nft_set_pipapo: make the symbol 'nft_pipapo_get' static
Fix the following sparse warning:

net/netfilter/nft_set_pipapo.c:739:6: warning: symbol 'nft_pipapo_get' was not declared. Should it be static?

Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Acked-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Li RongQing
9325f070f7 netfilter: cleanup unused macro
TEMPLATE_NULLS_VAL is not used after commit 0838aa7fcf
("netfilter: fix netns dependencies with conntrack templates")

PFX is not used after commit 8bee4bad03 ("netfilter: xt
extensions: use pr_<level>")

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Florian Westphal
24d19826fc netfilter: nf_tables: make all set structs const
They do not need to be writeable anymore.

v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Florian Westphal
e32a4dc651 netfilter: nf_tables: make sets built-in
Placing nftables set support in an extra module is pointless:

1. nf_tables needs dynamic registeration interface for sake of one module
2. nft heavily relies on sets, e.g. even simple rule like
   "nft ... tcp dport { 80, 443 }" will not work with _SETS=n.

IOW, either nftables isn't used or both nf_tables and nf_tables_set
modules are needed anyway.

With extra module:
 307K net/netfilter/nf_tables.ko
  79K net/netfilter/nf_tables_set.ko

   text  data  bss     dec filename
 146416  3072  545  150033 nf_tables.ko
  35496  1817    0   37313 nf_tables_set.ko

This patch:
 373K net/netfilter/nf_tables.ko

 178563  4049  545  183157 nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Xin Long
925d844696 netfilter: nft_tunnel: add support for geneve opts
Like vxlan and erspan opts, geneve opts should also be supported in
nft_tunnel. The difference is geneve RFC (draft-ietf-nvo3-geneve-14)
allows a geneve packet to carry multiple geneve opts. So with this
patch, nftables/libnftnl would do:

  # nft add table ip filter
  # nft add chain ip filter input { type filter hook input priority 0 \; }
  # nft add tunnel filter geneve_02 { type geneve\; id 2\; \
    ip saddr 192.168.1.1\; ip daddr 192.168.1.2\; \
    sport 9000\; dport 9001\; dscp 1234\; ttl 64\; flags 1\; \
    opts \"1:1:34567890,2:2:12121212,3:3:1212121234567890\"\; }
  # nft list tunnels table filter
    table ip filter {
    	tunnel geneve_02 {
    		id 2
    		ip saddr 192.168.1.1
    		ip daddr 192.168.1.2
    		sport 9000
    		dport 9001
    		tos 18
    		ttl 64
    		flags 1
    		geneve opts 1:1:34567890,2:2:12121212,3:3:1212121234567890
    	}
    }

v1->v2:
  - no changes, just post it separately.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Manoj Basapathi
68983a354a netfilter: xtables: Add snapshot of hardidletimer target
This is a snapshot of hardidletimer netfilter target.

This patch implements a hardidletimer Xtables target that can be
used to identify when interfaces have been idle for a certain period
of time.

Timers are identified by labels and are created when a rule is set
with a new label. The rules also take a timeout value (in seconds) as
an option. If more than one rule uses the same timer label, the timer
will be restarted whenever any of the rules get a hit.

One entry for each timer is created in sysfs. This attribute contains
the timer remaining for the timer to expire. The attributes are
located under the xt_idletimer class:

/sys/class/xt_idletimer/timers/<label>

When the timer expires, the target module sends a sysfs notification
to the userspace, which can then decide what to do (eg. disconnect to
save power)

Compared to IDLETIMER, HARDIDLETIMER can send notifications when
CPU is in suspend too, to notify the timer expiry.

v1->v2: Moved all functionality into IDLETIMER module to avoid
code duplication per comment from Florian.

Signed-off-by: Manoj Basapathi <manojbm@codeaurora.org>
Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Paul Blakey
c3c831b0a2 netfilter: flowtable: Use nf_flow_offload_tuple for stats as well
This patch doesn't change any functionality.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:15 +01:00
Willem de Bruijn
61fad6816f net/packet: tpacket_rcv: avoid a producer race condition
PACKET_RX_RING can cause multiple writers to access the same slot if a
fast writer wraps the ring while a slow writer is still copying. This
is particularly likely with few, large, slots (e.g., GSO packets).

Synchronize kernel thread ownership of rx ring slots with a bitmap.

Writers acquire a slot race-free by testing tp_status TP_STATUS_KERNEL
while holding the sk receive queue lock. They release this lock before
copying and set tp_status to TP_STATUS_USER to release to userspace
when done. During copying, another writer may take the lock, also see
TP_STATUS_KERNEL, and start writing to the same slot.

Introduce a new rx_owner_map bitmap with a bit per slot. To acquire a
slot, test and set with the lock held. To release race-free, update
tp_status and owner bit as a transaction, so take the lock again.

This is the one of a variety of discussed options (see Link below):

* instead of a shadow ring, embed the data in the slot itself, such as
in tp_padding. But any test for this field may match a value left by
userspace, causing deadlock.

* avoid the lock on release. This leaves a small race if releasing the
shadow slot before setting TP_STATUS_USER. The below reproducer showed
that this race is not academic. If releasing the slot after tp_status,
the race is more subtle. See the first link for details.

* add a new tp_status TP_KERNEL_OWNED to avoid the transactional store
of two fields. But, legacy applications may interpret all non-zero
tp_status as owned by the user. As libpcap does. So this is possible
only opt-in by newer processes. It can be added as an optional mode.

* embed the struct at the tail of pg_vec to avoid extra allocation.
The implementation proved no less complex than a separate field.

The additional locking cost on release adds contention, no different
than scaling on multicore or multiqueue h/w. In practice, below
reproducer nor small packet tcpdump showed a noticeable change in
perf report in cycles spent in spinlock. Where contention is
problematic, packet sockets support mitigation through PACKET_FANOUT.
And we can consider adding opt-in state TP_KERNEL_OWNED.

Easy to reproduce by running multiple netperf or similar TCP_STREAM
flows concurrently with `tcpdump -B 129 -n greater 60000`.

Based on an earlier patchset by Jon Rosen. See links below.

I believe this issue goes back to the introduction of tpacket_rcv,
which predates git history.

Link: https://www.mail-archive.com/netdev@vger.kernel.org/msg237222.html
Suggested-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jon Rosen <jrosen@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:25:25 -07:00
Paolo Abeni
dc093db5cc mptcp: drop unneeded checks
After the previous patch subflow->conn is always != NULL and
is never changed. We can drop a bunch of now unneeded checks.

v1 -> v2:
 - rebased on top of commit 2398e3991b ("mptcp: always
   include dack if possible.")

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:19:03 -07:00
Paolo Abeni
58b0991962 mptcp: create msk early
This change moves the mptcp socket allocation from mptcp_accept() to
subflow_syn_recv_sock(), so that subflow->conn is now always set
for the non fallback scenario.

It allows cleaning up a bit mptcp_accept() reducing the additional
locking and will allow fourther cleanup in the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:19:03 -07:00
Petr Machata
e1f8f78ffe net: ip_gre: Separate ERSPAN newlink / changelink callbacks
ERSPAN shares most of the code path with GRE and gretap code. While that
helps keep the code compact, it is also error prone. Currently a broken
userspace can turn a gretap tunnel into a de facto ERSPAN one by passing
IFLA_GRE_ERSPAN_VER. There has been a similar issue in ip6gretap in the
past.

To prevent these problems in future, split the newlink and changelink code
paths. Split the ERSPAN code out of ipgre_netlink_parms() into a new
function erspan_netlink_parms(). Extract a piece of common logic from
ipgre_newlink() and ipgre_changelink() into ipgre_newlink_encap_setup().
Add erspan_newlink() and erspan_changelink().

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:14:08 -07:00
Hoang Le
746a1eda68 tipc: add NULL pointer check to prevent kernel oops
Calling:
tipc_node_link_down()->
   - tipc_node_write_unlock()->tipc_mon_peer_down()
   - tipc_mon_peer_down()
  just after disabling bearer could be caused kernel oops.

Fix this by adding a sanity check to make sure valid memory
access.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:07:00 -07:00
Hoang Le
e228c5c088 tipc: simplify trivial boolean return
Checking and returning 'true' boolean is useless as it will be
returning at end of function

Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-15 00:07:00 -07:00
Petr Machata
0a7fad2376 net: sched: RED: Introduce an ECN nodrop mode
When the RED Qdisc is currently configured to enable ECN, the RED algorithm
is used to decide whether a certain SKB should be marked. If that SKB is
not ECN-capable, it is early-dropped.

It is also possible to keep all traffic in the queue, and just mark the
ECN-capable subset of it, as appropriate under the RED algorithm. Some
switches support this mode, and some installations make use of it.

To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
configured with this flag, non-ECT traffic is enqueued instead of being
early-dropped.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 21:03:46 -07:00
Petr Machata
14bc175d9c net: sched: Allow extending set of supported RED flags
The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool
of global RED flags. These are passed in tc_red_qopt.flags. However none of
these qdiscs validate the flag field, and just copy it over wholesale to
internal structures, and later dump it back. (An exception is GRED, which
does validate for VQs -- however not for the main setup.)

A broken userspace can therefore configure a qdisc with arbitrary
unsupported flags, and later expect to see the flags on qdisc dump. The
current ABI therefore allows storage of several bits of custom data to
qdisc instances of the types mentioned above. How many bits, depends on
which flags are meaningful for the qdisc in question. E.g. SFQ recognizes
flags ECN and HARDDROP, and the rest is not interpreted.

If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it,
and at the same time it needs to retain the possibility to store 6 bits of
uninterpreted data. Likewise RED, which adds a new flag later in this
patchset.

To that end, this patch adds a new function, red_get_flags(), to split the
passed flags of RED-like qdiscs to flags and user bits, and
red_validate_flags() to validate the resulting configuration. It further
adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 21:03:46 -07:00
Bruno Meneguele
13d0f7b814 net/bpfilter: fix dprintf usage for /dev/kmsg
The bpfilter UMH code was recently changed to log its informative messages to
/dev/kmsg, however this interface doesn't support SEEK_CUR yet, used by
dprintf(). As result dprintf() returns -EINVAL and doesn't log anything.

However there already had some discussions about supporting SEEK_CUR into
/dev/kmsg interface in the past it wasn't concluded. Since the only user of
that from userspace perspective inside the kernel is the bpfilter UMH
(userspace) module it's better to correct it here instead waiting a conclusion
on the interface.

Fixes: 36c4357c63 ("net: bpfilter: print umh messages to /dev/kmsg")
Signed-off-by: Bruno Meneguele <bmeneg@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 20:58:10 -07:00
Cong Wang
0d1c3530e1 net_sched: keep alloc_hash updated after hash allocation
In commit 599be01ee5 ("net_sched: fix an OOB access in cls_tcindex")
I moved cp->hash calculation before the first
tcindex_alloc_perfect_hash(), but cp->alloc_hash is left untouched.
This difference could lead to another out of bound access.

cp->alloc_hash should always be the size allocated, we should
update it after this tcindex_alloc_perfect_hash().

Reported-and-tested-by: syzbot+dcc34d54d68ef7d2d53d@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+c72da7b9ed57cde6fca2@syzkaller.appspotmail.com
Fixes: 599be01ee5 ("net_sched: fix an OOB access in cls_tcindex")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 20:42:29 -07:00
Cong Wang
b1be2e8cd2 net_sched: hold rtnl lock in tcindex_partial_destroy_work()
syzbot reported a use-after-free in tcindex_dump(). This is due to
the lack of RTNL in the deferred rcu work. We queue this work with
RTNL in tcindex_change(), later, tcindex_dump() is called:

        fh = tp->ops->get(tp, t->tcm_handle);
	...
        err = tp->ops->change(..., &fh, ...);
        tfilter_notify(..., fh, ...);

but there is nothing to serialize the pending
tcindex_partial_destroy_work() with tcindex_dump().

Fix this by simply holding RTNL in tcindex_partial_destroy_work(),
so that it won't be called until RTNL is released after
tc_new_tfilter() is completed.

Reported-and-tested-by: syzbot+653090db2562495901dc@syzkaller.appspotmail.com
Fixes: 3d210534cc ("net_sched: fix a race condition in tcindex_destroy()")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 20:41:17 -07:00
YueHaibing
965995b7d7 Bluetooth: L2CAP: remove set but not used variable 'credits'
net/bluetooth/l2cap_core.c: In function l2cap_ecred_conn_req:
net/bluetooth/l2cap_core.c:5848:6: warning: variable credits set but not used [-Wunused-but-set-variable]

commit 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
involved this unused variable, remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-14 19:49:28 +01:00
David S. Miller
44ef976ab3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-03-13

The following pull-request contains BPF updates for your *net-next* tree.

We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).

The main changes are:

1) Add modify_return attach type which allows to attach to a function via
   BPF trampoline and is run after the fentry and before the fexit programs
   and can pass a return code to the original caller, from KP Singh.

2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
   objects to be visible in /proc/kallsyms so they can be annotated in
   stack traces, from Jiri Olsa.

3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
   in order to enable this for BPF based socket dispatch, from Lorenz Bauer.

4) Introduce a new bpftool 'prog profile' command which attaches to existing
   BPF programs via fentry and fexit hooks and reads out hardware counters
   during that period, from Song Liu. Example usage:

   bpftool prog profile id 337 duration 3 cycles instructions llc_misses

        4228 run_cnt
     3403698 cycles                                              (84.08%)
     3525294 instructions   #  1.04 insn per cycle               (84.05%)
          13 llc_misses     #  3.69 LLC misses per million isns  (83.50%)

5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
   of a new bpf_link abstraction to keep in particular BPF tracing programs
   attached even when the applicaion owning them exits, from Andrii Nakryiko.

6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
   and which returns the PID as seen by the init namespace, from Carlos Neira.

7) Refactor of RISC-V JIT code to move out common pieces and addition of a
   new RV32G BPF JIT compiler, from Luke Nelson.

8) Add gso_size context member to __sk_buff in order to be able to know whether
   a given skb is GSO or not, from Willem de Bruijn.

9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
   implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13 20:52:03 -07:00
David Howells
7d7587db0d afs: Fix client call Rx-phase signal handling
Fix the handling of signals in client rxrpc calls made by the afs
filesystem.  Ignore signals completely, leaving call abandonment or
connection loss to be detected by timeouts inside AF_RXRPC.

Allowing a filesystem call to be interrupted after the entire request has
been transmitted and an abort sent means that the server may or may not
have done the action - and we don't know.  It may even be worse than that
for older servers.

Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13 23:04:35 +00:00
David Howells
498b577660 rxrpc: Fix sendmsg(MSG_WAITALL) handling
Fix the handling of sendmsg() with MSG_WAITALL for userspace to round the
timeout for when a signal occurs up to at least two jiffies as a 1 jiffy
timeout may end up being effectively 0 if jiffies wraps at the wrong time.

Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13 23:04:34 +00:00
David Howells
e138aa7d32 rxrpc: Fix call interruptibility handling
Fix the interruptibility of kernel-initiated client calls so that they're
either only interruptible when they're waiting for a call slot to come
available or they're not interruptible at all.  Either way, they're not
interruptible during transmission.

This should help prevent StoreData calls from being interrupted when
writeback is in progress.  It doesn't, however, handle interruption during
the receive phase.

Userspace-initiated calls are still interruptable.  After the signal has
been handled, sendmsg() will return the amount of data copied out of the
buffer and userspace can perform another sendmsg() call to continue
transmission.

Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13 23:04:30 +00:00
David Howells
158fe66653 rxrpc: Abstract out the calculation of whether there's Tx space
Abstract out the calculation of there being sufficient Tx buffer space.
This is reproduced several times in the rxrpc sendmsg code.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13 23:04:28 +00:00
Björn Töpel
6a64037d4b bpf: Add bpf_trampoline_ name prefix for DECLARE_BPF_DISPATCHER
Adding bpf_trampoline_ name prefix for DECLARE_BPF_DISPATCHER,
so all the dispatchers have the common name prefix.

And also a small '_' cleanup for bpf_dispatcher_nopfunc function
name.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Jiri Olsa <jolsa@kernel.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200312195610.346362-3-jolsa@kernel.org
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2020-03-13 12:49:51 -07:00
Colin Ian King
5ec82c49a2 ethtool: fix spelling mistake "exceeeds" -> "exceeds"
There are a couple of spelling mistakes in NL_SET_ERR_MSG_ATTR messages.
Fix these.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13 11:23:33 -07:00
David S. Miller
242a6df688 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-03-12

The following pull-request contains BPF updates for your *net* tree.

We've added 12 non-merge commits during the last 8 day(s) which contain
a total of 12 files changed, 161 insertions(+), 15 deletions(-).

The main changes are:

1) Andrii fixed two bugs in cgroup-bpf.

2) John fixed sockmap.

3) Luke fixed x32 jit.

4) Martin fixed two issues in struct_ops.

5) Yonghong fixed bpf_send_signal.

6) Yoshiki fixed BTF enum.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13 11:13:45 -07:00
David S. Miller
1d34357931 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 22:34:48 -07:00
Eelco Chaudron
d831ee84bf bpf: Add bpf_xdp_output() helper
Introduce new helper that reuses existing xdp perf_event output
implementation, but can be called from raw_tracepoint programs
that receive 'struct xdp_buff *' as a tracepoint argument.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158348514556.2239.11050972434793741444.stgit@xdp-tutorial
2020-03-12 17:47:38 -07:00
Joe Perches
a8eceea84a inet: Use fallthrough;
Convert the various uses of fallthrough comments to fallthrough;

Done via script
Link: https://lore.kernel.org/lkml/b56602fcf79f849e733e7b521bb0e17895d390fa.1582230379.git.joe@perches.com/

And by hand:

net/ipv6/ip6_fib.c has a fallthrough comment outside of an #ifdef block
that causes gcc to emit a warning if converted in-place.

So move the new fallthrough; inside the containing #ifdef/#endif too.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:55:00 -07:00
Michal Kubecek
546379b9a0 ethtool: add CHANNELS_NTF notification
Send ETHTOOL_MSG_CHANNELS_NTF notification whenever channel counts of
a network device are modified using ETHTOOL_MSG_CHANNELS_SET netlink
message or ETHTOOL_SCHANNELS ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
e19c591eaf ethtool: set device channel counts with CHANNELS_SET request
Implement CHANNELS_SET netlink request to set channel counts of a network
device. These are traditionally set with ETHTOOL_SCHANNELS ioctl request.

Like the ioctl implementation, the generic ethtool code checks if supplied
values do not exceed driver defined limits; if they do, first offending
attribute is reported using extack. Checks preventing removing channels
used for RX indirection table or zerocopy AF_XDP socket are also
implemented.

Move ethtool_get_max_rxfh_channel() helper into common.c so that it can be
used by both ioctl and netlink code.

v2:
  - fix netdev reference leak in error path (found by Jakub Kicinsky)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
0c84979c95 ethtool: provide channel counts with CHANNELS_GET request
Implement CHANNELS_GET request to get channel counts of a network device.
These are traditionally available via ETHTOOL_GCHANNELS ioctl request.

Omit attributes for channel types which are not supported by driver or
device (zero reported for maximum).

v2: (all suggested by Jakub Kicinski)
  - minor cleanup in channels_prepare_data()
  - more descriptive channels_reply_size()
  - omit attributes with zero max count

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
bc9d1c995e ethtool: add RINGS_NTF notification
Send ETHTOOL_MSG_RINGS_NTF notification whenever ring sizes of a network
device are modified using ETHTOOL_MSG_RINGS_SET netlink message or
ETHTOOL_SRINGPARAM ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
2fc2929e80 ethtool: set device ring sizes with RINGS_SET request
Implement RINGS_SET netlink request to set ring sizes of a network device.
These are traditionally set with ETHTOOL_SRINGPARAM ioctl request.

Like the ioctl implementation, the generic ethtool code checks if supplied
values do not exceed driver defined limits; if they do, first offending
attribute is reported using extack.

v2:
  - fix netdev reference leak in error path (found by Jakub Kicinsky)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
e4a1717b67 ethtool: provide ring sizes with RINGS_GET request
Implement RINGS_GET request to get ring sizes of a network device. These
are traditionally available via ETHTOOL_GRINGPARAM ioctl request.

Omit attributes for ring types which are not supported by driver or device
(zero reported for maximum).

v2: (all suggested by Jakub Kicinski)
  - minor cleanup in rings_prepare_data()
  - more descriptive rings_reply_size()
  - omit attributes with zero max size

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
111dcba3c6 ethtool: add PRIVFLAGS_NTF notification
Send ETHTOOL_MSG_PRIVFLAGS_NTF notification whenever private flags of
a network device are modified using ETHTOOL_MSG_PRIVFLAGS_SET netlink
message or ETHTOOL_SPFLAGS ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
f265d79959 ethtool: set device private flags with PRIVFLAGS_SET request
Implement PRIVFLAGS_SET netlink request to set private flags of a network
device. These are traditionally set with ETHTOOL_SPFLAGS ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
e16c3386fc ethtool: provide private flags with PRIVFLAGS_GET request
Implement PRIVFLAGS_GET request to get private flags for a network device.
These are traditionally available via ETHTOOL_GPFLAGS ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
9c6451ef48 ethtool: add FEATURES_NTF notification
Send ETHTOOL_MSG_FEATURES_NTF notification whenever network device features
are modified using ETHTOOL_MSG_FEATURES_SET netlink message, ethtool ioctl
request or any other way resulting in call to netdev_update_features() or
netdev_change_features()

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:33 -07:00
Michal Kubecek
0980bfcd69 ethtool: set netdev features with FEATURES_SET request
Implement FEATURES_SET netlink request to set network device features.
These are traditionally set using ETHTOOL_SFEATURES ioctl request.

Actual change is subject to netdev_change_features() sanity checks so that
it can differ from what was requested. Unlike with most other SET requests,
in addition to error code and optional extack, kernel provides an optional
reply message (ETHTOOL_MSG_FEATURES_SET_REPLY) in the same format but with
different semantics: information about difference between user request and
actual result and difference between old and new state of dev->features.
This reply message can be suppressed by setting ETHTOOL_FLAG_OMIT_REPLY
flag in request header.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:32 -07:00
Michal Kubecek
88db6d1e4f ethtool: add ethnl_parse_bitset() helper
Unlike other SET type commands, modifying netdev features is required to
provide a reply telling userspace what was actually changed, compared to
what was requested. For that purpose, the "modified" flag provided by
ethnl_update_bitset() is not sufficient, we need full information which
bits were requested to change.

Therefore provide ethnl_parse_bitset() returning effective value and mask
bitmaps equivalent to the contents of a bitset nested attribute.

v2: use non-atomic __set_bit() (suggested by David Miller)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:32 -07:00
Michal Kubecek
0524399d46 ethtool: provide netdev features with FEATURES_GET request
Implement FEATURES_GET request to get network device features. These are
traditionally available via ETHTOOL_GFEATURES ioctl request.

v2:
  - style cleanup suggested by Jakub Kicinski

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:32 -07:00
Michal Kubecek
f70bb06563 ethtool: update mapping of features to legacy ioctl requests
Legacy ioctl request like ETHTOOL_GTXCSUM are still used by ethtool utility
to get values of legacy flags (which rather work as feature groups). These
are calculated from values of actual features and request to set them is
implemented as an attempt to set all features mapping to them but there are
two inconsistencies:

- tx-checksum-fcoe-crc is shown under tx-checksumming but NETIF_F_FCOE_CRC
  is not included in ETHTOOL_GTXCSUM/ETHTOOL_STXCSUM
- tx-scatter-gather-fraglist is shown under scatter-gather but
  NETIF_F_FRAGLIST is not included in ETHTOOL_GSG/ETHTOOL_SSG

As the mapping in ethtool output is more correct from logical point of
view, fix ethtool_get_feature_mask() to match it.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:32 -07:00
Michal Kubecek
98130546da ethtool: rename ethnl_parse_header() to ethnl_parse_header_dev_get()
Andrew Lunn pointed out that even if it's documented that
ethnl_parse_header() takes reference to network device if it fills it
into the target structure, its name doesn't make it apparent so that
corresponding dev_put() looks like mismatched.

Rename the function ethnl_parse_header_dev_get() to indicate that it
takes a reference.

Suggested-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:32:32 -07:00
Paul Blakey
ee1c45e875 flow_offload: Add flow_match_ct to get rule ct match
Add relevant getter for ct info dissector.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:39 -07:00
Paul Blakey
edd5861e59 net/sched: act_ct: Enable hardware offload of flow table entires
Pass the zone's flow table instance on the flow action to the drivers.
Thus, allowing drivers to register FT add/del/stats callbacks.

Finally, enable hardware offload on the flow table instance.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:39 -07:00
Paul Blakey
8b3646d6e0 net/sched: act_ct: Support refreshing the flow table entries
If driver deleted an FT entry, a FT failed to offload, or registered to the
flow table after flows were already added, we still get packets in
software.

For those packets, while restoring the ct state from the flow table
entry, refresh it's hardware offload.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:39 -07:00
Paul Blakey
30b0cf90c6 net/sched: act_ct: Support restoring conntrack info on skbs
Provide an API to restore the ct state pointer.

This may be used by drivers to restore the ct state if they
miss in tc chain after they already did the hardware connection
tracking action (ct_metadata action).

For example, consider the following rule on chain 0 that is in_hw,
however chain 1 is not_in_hw:

$ tc filter add dev ... chain 0 ... \
  flower ... action ct pipe action goto chain 1

Packets of a flow offloaded (via nf flow table offload) by the driver
hit this rule in hardware, will be marked with the ct metadata action
(mark, label, zone) that does the equivalent of the software ct action,
and when the packet jumps to hardware chain 1, there would be a miss.

CT was already processed in hardware. Therefore, the driver's miss
handling should restore the ct state on the skb, using the provided API,
and continue the packet processing in chain 1.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:38 -07:00
Paul Blakey
9c26ba9b1f net/sched: act_ct: Instantiate flow table entry actions
NF flow table API associate 5-tuple rule with an action list by calling
the flow table type action() CB to fill the rule's actions.

In action CB of act_ct, populate the ct offload entry actions with a new
ct_metadata action. Initialize the ct_metadata with the ct mark, label and
zone information. If ct nat was performed, then also append the relevant
packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).

Drivers that offload the ft entries may match on the 5-tuple and perform
the action list.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:38 -07:00
Paul Blakey
978703f425 netfilter: flowtable: Add API for registering to flow table events
Let drivers to add their cb allowing them to receive flow offload events
of type TC_SETUP_CLSFLOWER (REPLACE/DEL/STATS) for flows managed by the
flow table.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:38 -07:00
David S. Miller
bf3347c4d1 Merge branch 'ct-offload' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux 2020-03-12 12:34:23 -07:00
Kuniyuki Iwashima
335759211a tcp: Forbid to bind more than one sockets haveing SO_REUSEADDR and SO_REUSEPORT per EUID.
If there is no TCP_LISTEN socket on a ephemeral port, we can bind multiple
sockets having SO_REUSEADDR to the same port. Then if all sockets bound to
the port have also SO_REUSEPORT enabled and have the same EUID, all of them
can be listened. This is not safe.

Let's say, an application has root privilege and binds sockets to an
ephemeral port with both of SO_REUSEADDR and SO_REUSEPORT. When none of
sockets is not listened yet, a malicious user can use sudo, exhaust
ephemeral ports, and bind sockets to the same ephemeral port, so he or she
can call listen and steal the port.

To prevent this issue, we must not bind more than one sockets that have the
same EUID and both of SO_REUSEADDR and SO_REUSEPORT.

On the other hand, if the sockets have different EUIDs, the issue above does
not occur. After sockets with different EUIDs are bound to the same port and
one of them is listened, no more socket can be listened. This is because the
condition below is evaluated true and listen() for the second socket fails.

			} else if (!reuseport_ok ||
				   !reuseport || !sk2->sk_reuseport ||
				   rcu_access_pointer(sk->sk_reuseport_cb) ||
				   (sk2->sk_state != TCP_TIME_WAIT &&
				    !uid_eq(uid, sock_i_uid(sk2)))) {
				if (inet_rcv_saddr_equal(sk, sk2, true))
					break;
			}

Therefore, on the same port, we cannot do listen() for multiple sockets with
different EUIDs and any other listen syscalls fail, so the problem does not
happen. In this case, we can still call connect() for other sockets that
cannot be listened, so we have to succeed to call bind() in order to fully
utilize 4-tuples.

Summarizing the above, we should be able to bind only one socket having
SO_REUSEADDR and SO_REUSEPORT per EUID.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 12:08:09 -07:00
Kuniyuki Iwashima
4b01a96742 tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
Commit aacd9289af ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.

The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.

You can reproduce this issue by following instructions below.

  1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
  2. set SO_REUSEADDR to two sockets.
  3. bind two sockets to (localhost, 0) and the latter fails.

Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.

This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.

  1. setsockopt(sk1, SO_REUSEADDR)
  2. setsockopt(sk2, SO_REUSEADDR)
  3. bind(sk1, saddr, 0)
  4. bind(sk2, saddr, 0)
  5. connect(sk1, daddr)
  6. listen(sk2)

If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.

The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 12:08:09 -07:00
Kuniyuki Iwashima
16f6c2518f tcp: Remove unnecessary conditions in inet_csk_bind_conflict().
When we get an ephemeral port, the relax is false, so the SO_REUSEADDR
conditions may be evaluated twice. We do not need to check the conditions
again.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 12:08:09 -07:00
Vinicius Costa Gomes
b09fe70ef5 taprio: Fix sending packets without dequeueing them
There was a bug that was causing packets to be sent to the driver
without first calling dequeue() on the "child" qdisc. And the KASAN
report below shows that sending a packet without calling dequeue()
leads to bad results.

The problem is that when checking the last qdisc "child" we do not set
the returned skb to NULL, which can cause it to be sent to the driver,
and so after the skb is sent, it may be freed, and in some situations a
reference to it may still be in the child qdisc, because it was never
dequeued.

The crash log looks like this:

[   19.937538] ==================================================================
[   19.938300] BUG: KASAN: use-after-free in taprio_dequeue_soft+0x620/0x780
[   19.938968] Read of size 4 at addr ffff8881128628cc by task swapper/1/0
[   19.939612]
[   19.939772] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc3+ #97
[   19.940397] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qe4
[   19.941523] Call Trace:
[   19.941774]  <IRQ>
[   19.941985]  dump_stack+0x97/0xe0
[   19.942323]  print_address_description.constprop.0+0x3b/0x60
[   19.942884]  ? taprio_dequeue_soft+0x620/0x780
[   19.943325]  ? taprio_dequeue_soft+0x620/0x780
[   19.943767]  __kasan_report.cold+0x1a/0x32
[   19.944173]  ? taprio_dequeue_soft+0x620/0x780
[   19.944612]  kasan_report+0xe/0x20
[   19.944954]  taprio_dequeue_soft+0x620/0x780
[   19.945380]  __qdisc_run+0x164/0x18d0
[   19.945749]  net_tx_action+0x2c4/0x730
[   19.946124]  __do_softirq+0x268/0x7bc
[   19.946491]  irq_exit+0x17d/0x1b0
[   19.946824]  smp_apic_timer_interrupt+0xeb/0x380
[   19.947280]  apic_timer_interrupt+0xf/0x20
[   19.947687]  </IRQ>
[   19.947912] RIP: 0010:default_idle+0x2d/0x2d0
[   19.948345] Code: 00 00 41 56 41 55 65 44 8b 2d 3f 8d 7c 7c 41 54 55 53 0f 1f 44 00 00 e8 b1 b2 c5 fd e9 07 00 3
[   19.950166] RSP: 0018:ffff88811a3efda0 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
[   19.950909] RAX: 0000000080000000 RBX: ffff88811a3a9600 RCX: ffffffff8385327e
[   19.951608] RDX: 1ffff110234752c0 RSI: 0000000000000000 RDI: ffffffff8385262f
[   19.952309] RBP: ffffed10234752c0 R08: 0000000000000001 R09: ffffed10234752c1
[   19.953009] R10: ffffed10234752c0 R11: ffff88811a3a9607 R12: 0000000000000001
[   19.953709] R13: 0000000000000001 R14: 0000000000000000 R15: 0000000000000000
[   19.954408]  ? default_idle_call+0x2e/0x70
[   19.954816]  ? default_idle+0x1f/0x2d0
[   19.955192]  default_idle_call+0x5e/0x70
[   19.955584]  do_idle+0x3d4/0x500
[   19.955909]  ? arch_cpu_idle_exit+0x40/0x40
[   19.956325]  ? _raw_spin_unlock_irqrestore+0x23/0x30
[   19.956829]  ? trace_hardirqs_on+0x30/0x160
[   19.957242]  cpu_startup_entry+0x19/0x20
[   19.957633]  start_secondary+0x2a6/0x380
[   19.958026]  ? set_cpu_sibling_map+0x18b0/0x18b0
[   19.958486]  secondary_startup_64+0xa4/0xb0
[   19.958921]
[   19.959078] Allocated by task 33:
[   19.959412]  save_stack+0x1b/0x80
[   19.959747]  __kasan_kmalloc.constprop.0+0xc2/0xd0
[   19.960222]  kmem_cache_alloc+0xe4/0x230
[   19.960617]  __alloc_skb+0x91/0x510
[   19.960967]  ndisc_alloc_skb+0x133/0x330
[   19.961358]  ndisc_send_ns+0x134/0x810
[   19.961735]  addrconf_dad_work+0xad5/0xf80
[   19.962144]  process_one_work+0x78e/0x13a0
[   19.962551]  worker_thread+0x8f/0xfa0
[   19.962919]  kthread+0x2ba/0x3b0
[   19.963242]  ret_from_fork+0x3a/0x50
[   19.963596]
[   19.963753] Freed by task 33:
[   19.964055]  save_stack+0x1b/0x80
[   19.964386]  __kasan_slab_free+0x12f/0x180
[   19.964830]  kmem_cache_free+0x80/0x290
[   19.965231]  ip6_mc_input+0x38a/0x4d0
[   19.965617]  ipv6_rcv+0x1a4/0x1d0
[   19.965948]  __netif_receive_skb_one_core+0xf2/0x180
[   19.966437]  netif_receive_skb+0x8c/0x3c0
[   19.966846]  br_handle_frame_finish+0x779/0x1310
[   19.967302]  br_handle_frame+0x42a/0x830
[   19.967694]  __netif_receive_skb_core+0xf0e/0x2a90
[   19.968167]  __netif_receive_skb_one_core+0x96/0x180
[   19.968658]  process_backlog+0x198/0x650
[   19.969047]  net_rx_action+0x2fa/0xaa0
[   19.969420]  __do_softirq+0x268/0x7bc
[   19.969785]
[   19.969940] The buggy address belongs to the object at ffff888112862840
[   19.969940]  which belongs to the cache skbuff_head_cache of size 224
[   19.971202] The buggy address is located 140 bytes inside of
[   19.971202]  224-byte region [ffff888112862840, ffff888112862920)
[   19.972344] The buggy address belongs to the page:
[   19.972820] page:ffffea00044a1800 refcount:1 mapcount:0 mapping:ffff88811a2bd1c0 index:0xffff8881128625c0 compo0
[   19.973930] flags: 0x8000000000010200(slab|head)
[   19.974388] raw: 8000000000010200 ffff88811a2ed650 ffff88811a2ed650 ffff88811a2bd1c0
[   19.975151] raw: ffff8881128625c0 0000000000190013 00000001ffffffff 0000000000000000
[   19.975915] page dumped because: kasan: bad access detected
[   19.976461] page_owner tracks the page as allocated
[   19.976946] page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NO)
[   19.978332]  prep_new_page+0x24b/0x330
[   19.978707]  get_page_from_freelist+0x2057/0x2c90
[   19.979170]  __alloc_pages_nodemask+0x218/0x590
[   19.979619]  new_slab+0x9d/0x300
[   19.979948]  ___slab_alloc.constprop.0+0x2f9/0x6f0
[   19.980421]  __slab_alloc.constprop.0+0x30/0x60
[   19.980870]  kmem_cache_alloc+0x201/0x230
[   19.981269]  __alloc_skb+0x91/0x510
[   19.981620]  alloc_skb_with_frags+0x78/0x4a0
[   19.982043]  sock_alloc_send_pskb+0x5eb/0x750
[   19.982476]  unix_stream_sendmsg+0x399/0x7f0
[   19.982904]  sock_sendmsg+0xe2/0x110
[   19.983262]  ____sys_sendmsg+0x4de/0x6d0
[   19.983660]  ___sys_sendmsg+0xe4/0x160
[   19.984032]  __sys_sendmsg+0xab/0x130
[   19.984396]  do_syscall_64+0xe7/0xae0
[   19.984761] page last free stack trace:
[   19.985142]  __free_pages_ok+0x432/0xbc0
[   19.985533]  qlist_free_all+0x56/0xc0
[   19.985907]  quarantine_reduce+0x149/0x170
[   19.986315]  __kasan_kmalloc.constprop.0+0x9e/0xd0
[   19.986791]  kmem_cache_alloc+0xe4/0x230
[   19.987182]  prepare_creds+0x24/0x440
[   19.987548]  do_faccessat+0x80/0x590
[   19.987906]  do_syscall_64+0xe7/0xae0
[   19.988276]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   19.988775]
[   19.988930] Memory state around the buggy address:
[   19.989402]  ffff888112862780: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   19.990111]  ffff888112862800: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[   19.990822] >ffff888112862880: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[   19.991529]                                               ^
[   19.992081]  ffff888112862900: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
[   19.992796]  ffff888112862980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Reported-by: Michael Schmidt <michael.schmidt@eti.uni-siegen.de>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Andre Guedes <andre.guedes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 11:25:08 -07:00
Julian Wiedmann
7c4046b1c5 Revert "net: sched: make newly activated qdiscs visible"
This reverts commit 4cda75275f
from net-next.

Brown bag time.

Michal noticed that this change doesn't work at all when
netif_set_real_num_tx_queues() gets called prior to an initial
dev_activate(), as for instance igb does.

Doing so dies with:

[   40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
[   40.586922] #PF: supervisor read access in kernel mode
[   40.592668] #PF: error_code(0x0000) - not-present page
[   40.598405] PGD 0 P4D 0
[   40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
[   40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G            E     5.6.0-rc3-ethnl.50-default #1
[   40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[   40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
[   40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
[   40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
[   40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
[   40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
[   40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
[   40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
[   40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
[   40.699750] FS:  00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
[   40.708782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
[   40.723156] Call Trace:
[   40.725888]  dev_qdisc_set_real_num_tx_queues+0x61/0x90
[   40.731725]  netif_set_real_num_tx_queues+0x94/0x1d0
[   40.737286]  __igb_open+0x19a/0x5d0 [igb]
[   40.741767]  __dev_open+0xbb/0x150
[   40.745567]  __dev_change_flags+0x157/0x1a0
[   40.750240]  dev_change_flags+0x23/0x60

[...]

Fixes: 4cda75275f ("net: sched: make newly activated qdiscs visible")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 11:19:24 -07:00
Dan Carpenter
b354e6c10e Bluetooth: L2CAP: Fix a condition in l2cap_sock_recvmsg()
Smatch complains about the indenting:

    net/bluetooth/l2cap_sock.c:1027 l2cap_sock_recvmsg()
    warn: inconsistent indenting

It looks like this is supposed to be an "else if" condition.

Fixes: 15f02b9105 ("Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-12 15:12:00 +01:00
Howard Chung
96298f6401 Bluetooth: L2CAP: handle l2cap config request during open state
According to Core Spec Version 5.2 | Vol 3, Part A 6.1.5,
the incoming L2CAP_ConfigReq should be handled during
OPEN state.

The section below shows the btmon trace when running
L2CAP/COS/CFD/BV-12-C before and after this change.

=== Before ===
...
> ACL Data RX: Handle 256 flags 0x02 dlen 12                #22
      L2CAP: Connection Request (0x02) ident 2 len 4
        PSM: 1 (0x0001)
        Source CID: 65
< ACL Data TX: Handle 256 flags 0x00 dlen 16                #23
      L2CAP: Connection Response (0x03) ident 2 len 8
        Destination CID: 64
        Source CID: 65
        Result: Connection successful (0x0000)
        Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 12                #24
      L2CAP: Configure Request (0x04) ident 2 len 4
        Destination CID: 65
        Flags: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5      #25
        Num handles: 1
        Handle: 256
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5      #26
        Num handles: 1
        Handle: 256
        Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 16                #27
      L2CAP: Configure Request (0x04) ident 3 len 8
        Destination CID: 64
        Flags: 0x0000
        Option: Unknown (0x10) [hint]
        01 00                                            ..
< ACL Data TX: Handle 256 flags 0x00 dlen 18                #28
      L2CAP: Configure Response (0x05) ident 3 len 10
        Source CID: 65
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 672
> HCI Event: Number of Completed Packets (0x13) plen 5      #29
        Num handles: 1
        Handle: 256
        Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 14                #30
      L2CAP: Configure Response (0x05) ident 2 len 6
        Source CID: 64
        Flags: 0x0000
        Result: Success (0x0000)
> ACL Data RX: Handle 256 flags 0x02 dlen 20                #31
      L2CAP: Configure Request (0x04) ident 3 len 12
        Destination CID: 64
        Flags: 0x0000
        Option: Unknown (0x10) [hint]
        01 00 91 02 11 11                                ......
< ACL Data TX: Handle 256 flags 0x00 dlen 14                #32
      L2CAP: Command Reject (0x01) ident 3 len 6
        Reason: Invalid CID in request (0x0002)
        Destination CID: 64
        Source CID: 65
> HCI Event: Number of Completed Packets (0x13) plen 5      #33
        Num handles: 1
        Handle: 256
        Count: 1
...
=== After ===
...
> ACL Data RX: Handle 256 flags 0x02 dlen 12               #22
      L2CAP: Connection Request (0x02) ident 2 len 4
        PSM: 1 (0x0001)
        Source CID: 65
< ACL Data TX: Handle 256 flags 0x00 dlen 16               #23
      L2CAP: Connection Response (0x03) ident 2 len 8
        Destination CID: 64
        Source CID: 65
        Result: Connection successful (0x0000)
        Status: No further information available (0x0000)
< ACL Data TX: Handle 256 flags 0x00 dlen 12               #24
      L2CAP: Configure Request (0x04) ident 2 len 4
        Destination CID: 65
        Flags: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5     #25
        Num handles: 1
        Handle: 256
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5     #26
        Num handles: 1
        Handle: 256
        Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 16               #27
      L2CAP: Configure Request (0x04) ident 3 len 8
        Destination CID: 64
        Flags: 0x0000
        Option: Unknown (0x10) [hint]
        01 00                                            ..
< ACL Data TX: Handle 256 flags 0x00 dlen 18               #28
      L2CAP: Configure Response (0x05) ident 3 len 10
        Source CID: 65
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 672
> HCI Event: Number of Completed Packets (0x13) plen 5     #29
        Num handles: 1
        Handle: 256
        Count: 1
> ACL Data RX: Handle 256 flags 0x02 dlen 14               #30
      L2CAP: Configure Response (0x05) ident 2 len 6
        Source CID: 64
        Flags: 0x0000
        Result: Success (0x0000)
> ACL Data RX: Handle 256 flags 0x02 dlen 20               #31
      L2CAP: Configure Request (0x04) ident 3 len 12
        Destination CID: 64
        Flags: 0x0000
        Option: Unknown (0x10) [hint]
        01 00 91 02 11 11                                .....
< ACL Data TX: Handle 256 flags 0x00 dlen 18               #32
      L2CAP: Configure Response (0x05) ident 3 len 10
        Source CID: 65
        Flags: 0x0000
        Result: Success (0x0000)
        Option: Maximum Transmission Unit (0x01) [mandatory]
          MTU: 672
< ACL Data TX: Handle 256 flags 0x00 dlen 12               #33
      L2CAP: Configure Request (0x04) ident 3 len 4
        Destination CID: 65
        Flags: 0x0000
> HCI Event: Number of Completed Packets (0x13) plen 5     #34
        Num handles: 1
        Handle: 256
        Count: 1
> HCI Event: Number of Completed Packets (0x13) plen 5     #35
        Num handles: 1
        Handle: 256
        Count: 1
...

Signed-off-by: Howard Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-12 08:22:28 +01:00
Joseph Hwang
b8d290525e Bluetooth: clean up connection in hci_cs_disconnect
In bluetooth core specification 4.2,
Vol 2, Part E, 7.8.9 LE Set Advertise Enable Command, it says

    The Controller shall continue advertising until ...
    or until a connection is created or ...
    In these cases, advertising is then disabled.

Hence, advertising would be disabled before a connection is
established. In current kernel implementation, advertising would
be re-enabled when all connections are terminated.

The correct disconnection flow looks like

  < HCI Command: Disconnect

  > HCI Event: Command Status
      Status: Success

  > HCI Event: Disconnect Complete
      Status: Success

Specifically, the last Disconnect Complete Event would trigger a
callback function hci_event.c:hci_disconn_complete_evt() to
cleanup the connection and re-enable advertising when proper.

However, sometimes, there might occur an exception in the controller
when disconnection is being executed. The disconnection flow might
then look like

  < HCI Command: Disconnect

  > HCI Event: Command Status
      Status: Unknown Connection Identifier

  Note that "> HCI Event: Disconnect Complete" is missing when such an
exception occurs. This would result in advertising staying disabled
forever since the connection in question is not cleaned up correctly.

To fix the controller exception issue, we need to do some connection
cleanup when the disconnect command status indicates an error.

Signed-off-by: Joseph Hwang <josephsih@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-12 08:21:15 +01:00
Davide Caratti
767d3ded5f net: mptcp: don't hang before sending 'MP capable with data'
the following packetdrill script

  socket(..., SOCK_STREAM, IPPROTO_MPTCP) = 3
  fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
  fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
  connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
  > S 0:0(0) <mss 1460,sackOK,TS val 100 ecr 0,nop,wscale 8,mpcapable v1 flags[flag_h] nokey>
  < S. 0:0(0) ack 1 win 65535 <mss 1460,sackOK,TS val 700 ecr 100,nop,wscale 8,mpcapable v1 flags[flag_h] key[skey=2]>
  > . 1:1(0) ack 1 win 256 <nop, nop, TS val 100 ecr 700,mpcapable v1 flags[flag_h] key[ckey,skey]>
  getsockopt(3, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
  fcntl(3, F_SETFL, O_RDWR) = 0
  write(3, ..., 1000) = 1000

doesn't transmit 1KB data packet after a successful three-way-handshake,
using mp_capable with data as required by protocol v1, and write() hangs
forever:

 PID: 973    TASK: ffff97dd399cae80  CPU: 1   COMMAND: "packetdrill"
  #0 [ffffa9b94062fb78] __schedule at ffffffff9c90a000
  #1 [ffffa9b94062fc08] schedule at ffffffff9c90a4a0
  #2 [ffffa9b94062fc18] schedule_timeout at ffffffff9c90e00d
  #3 [ffffa9b94062fc90] wait_woken at ffffffff9c120184
  #4 [ffffa9b94062fcb0] sk_stream_wait_connect at ffffffff9c75b064
  #5 [ffffa9b94062fd20] mptcp_sendmsg at ffffffff9c8e801c
  #6 [ffffa9b94062fdc0] sock_sendmsg at ffffffff9c747324
  #7 [ffffa9b94062fdd8] sock_write_iter at ffffffff9c7473c7
  #8 [ffffa9b94062fe48] new_sync_write at ffffffff9c302976
  #9 [ffffa9b94062fed0] vfs_write at ffffffff9c305685
 #10 [ffffa9b94062ff00] ksys_write at ffffffff9c305985
 #11 [ffffa9b94062ff38] do_syscall_64 at ffffffff9c004475
 #12 [ffffa9b94062ff50] entry_SYSCALL_64_after_hwframe at ffffffff9ca0008c
     RIP: 00007f959407eaf7  RSP: 00007ffe9e95a910  RFLAGS: 00000293
     RAX: ffffffffffffffda  RBX: 0000000000000008  RCX: 00007f959407eaf7
     RDX: 00000000000003e8  RSI: 0000000001785fe0  RDI: 0000000000000008
     RBP: 0000000001785fe0   R8: 0000000000000000   R9: 0000000000000003
     R10: 0000000000000007  R11: 0000000000000293  R12: 00000000000003e8
     R13: 00007ffe9e95ae30  R14: 0000000000000000  R15: 0000000000000000
     ORIG_RAX: 0000000000000001  CS: 0033  SS: 002b

Fix it ensuring that socket state is TCP_ESTABLISHED on reception of the
third ack.

Fixes: 1954b86016 ("mptcp: Check connection state before attempting send")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:59:11 -07:00
Eric Dumazet
06669ea346 net: memcg: fix lockdep splat in inet_csk_accept()
Locking newsk while still holding the listener lock triggered
a lockdep splat [1]

We can simply move the memcg code after we release the listener lock,
as this can also help if multiple threads are sharing a common listener.

Also fix a typo while reading socket sk_rmem_alloc.

[1]
WARNING: possible recursive locking detected
5.6.0-rc3-syzkaller #0 Not tainted
--------------------------------------------
syz-executor598/9524 is trying to acquire lock:
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b8b90 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492

but task is already holding lock:
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(sk_lock-AF_INET6);
  lock(sk_lock-AF_INET6);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

1 lock held by syz-executor598/9524:
 #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: lock_sock include/net/sock.h:1541 [inline]
 #0: ffff88808b5b9590 (sk_lock-AF_INET6){+.+.}, at: inet_csk_accept+0x8d/0xd30 net/ipv4/inet_connection_sock.c:445

stack backtrace:
CPU: 0 PID: 9524 Comm: syz-executor598 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 print_deadlock_bug kernel/locking/lockdep.c:2370 [inline]
 check_deadlock kernel/locking/lockdep.c:2411 [inline]
 validate_chain kernel/locking/lockdep.c:2954 [inline]
 __lock_acquire.cold+0x114/0x288 kernel/locking/lockdep.c:3954
 lock_acquire+0x197/0x420 kernel/locking/lockdep.c:4484
 lock_sock_nested+0xc5/0x110 net/core/sock.c:2947
 lock_sock include/net/sock.h:1541 [inline]
 inet_csk_accept+0x69f/0xd30 net/ipv4/inet_connection_sock.c:492
 inet_accept+0xe9/0x7c0 net/ipv4/af_inet.c:734
 __sys_accept4_file+0x3ac/0x5b0 net/socket.c:1758
 __sys_accept4+0x53/0x90 net/socket.c:1809
 __do_sys_accept4 net/socket.c:1821 [inline]
 __se_sys_accept4 net/socket.c:1818 [inline]
 __x64_sys_accept4+0x93/0xf0 net/socket.c:1818
 do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4445c9
Code: e8 0c 0d 03 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffc35b37608 EFLAGS: 00000246 ORIG_RAX: 0000000000000120
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004445c9
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 0000000000000000 R08: 0000000000306777 R09: 0000000000306777
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
R13: 00000000004053d0 R14: 0000000000000000 R15: 0000000000000000

Fixes: d752a49865 ("net: memcg: late association of sock to memcg")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Shakeel Butt <shakeelb@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:57:33 -07:00
Paolo Lungaroni
2677625387 seg6: fix SRv6 L2 tunnels to use IANA-assigned protocol number
The Internet Assigned Numbers Authority (IANA) has recently assigned
a protocol number value of 143 for Ethernet [1].

Before this assignment, encapsulation mechanisms such as Segment Routing
used the IPv6-NoNxt protocol number (59) to indicate that the encapsulated
payload is an Ethernet frame.

In this patch, we add the definition of the Ethernet protocol number to the
kernel headers and update the SRv6 L2 tunnels to use it.

[1] https://www.iana.org/assignments/protocol-numbers/protocol-numbers.xhtml

Signed-off-by: Paolo Lungaroni <paolo.lungaroni@cnit.it>
Reviewed-by: Andrea Mayer <andrea.mayer@uniroma2.it>
Acked-by: Ahmed Abdelsalam <ahmed.abdelsalam@gssi.it>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:49:30 -07:00
Andrew Lunn
a20f997010 net: dsa: Don't instantiate phylink for CPU/DSA ports unless needed
By default, DSA drivers should configure CPU and DSA ports to their
maximum speed. In many configurations this is sufficient to make the
link work.

In some cases it is necessary to configure the link to run slower,
e.g. because of limitations of the SoC it is connected to. Or back to
back PHYs are used and the PHY needs to be driven in order to
establish link. In this case, phylink is used.

Only instantiate phylink if it is required. If there is no PHY, or no
fixed link properties, phylink can upset a link which works in the
default configuration.

Fixes: 0e27921816 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:46:11 -07:00
Jules Irenge
64fbca0119 net: Add missing annotation for *netlink_seq_start()
Sparse reports a warning at netlink_seq_start()

warning: context imbalance in netlink_seq_start() - wrong count at exit
The root cause is the missing annotation at netlink_seq_start()
Add the missing  __acquires(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:19:41 -07:00
Jules Irenge
734c8f7574 tcp: Add missing annotation for tcp_child_process()
Sparse reports warning at tcp_child_process()
warning: context imbalance in tcp_child_process() - unexpected unlock
The root cause is the missing annotation at tcp_child_process()

Add the missing __releases(&((child)->sk_lock.slock)) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:19:41 -07:00
Jules Irenge
0d8a42c93a raw: Add missing annotations to raw_seq_start() and raw_seq_stop()
Sparse reports warnings at raw_seq_start() and raw_seq_stop()

warning: context imbalance in raw_seq_start() - wrong count at exit
warning: context imbalance in raw_seq_stop() - unexpected unlock

The root cause is the missing annotations at raw_seq_start()
	and raw_seq_stop()
Add the missing __acquires(&h->lock) annotation
Add the missing __releases(&h->lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:19:40 -07:00
Julian Wiedmann
4cda75275f net: sched: make newly activated qdiscs visible
In their .attach callback, mq[prio] only add the qdiscs of the currently
active TX queues to the device's qdisc hash list.
If a user later increases the number of active TX queues, their qdiscs
are not visible via eg. 'tc qdisc show'.

Add a hook to netif_set_real_num_tx_queues() that walks all active
TX queues and adds those which are missing to the hash list.

CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:17:28 -07:00
Willem de Bruijn
46e4c421a0 net/packet: tpacket_rcv: do not increment ring index on drop
In one error case, tpacket_rcv drops packets after incrementing the
ring producer index.

If this happens, it does not update tp_status to TP_STATUS_USER and
thus the reader is stalled for an iteration of the ring, causing out
of order arrival.

The only such error path is when virtio_net_hdr_from_skb fails due
to encountering an unknown GSO type.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:12:16 -07:00
Amol Grover
f9fc28a8de net: caif: Add lockdep expression to RCU traversal primitive
caifdevs->list is traversed using list_for_each_entry_rcu()
outside an RCU read-side critical section but under the
protection of rtnl_mutex. Hence, add the corresponding lockdep
expression to silence the following false-positive warning:

[   10.868467] =============================
[   10.869082] WARNING: suspicious RCU usage
[   10.869817] 5.6.0-rc1-00177-g06ec0a154aae4 #1 Not tainted
[   10.870804] -----------------------------
[   10.871557] net/caif/caif_dev.c:115 RCU-list traversed in non-reader section!!

Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 22:55:25 -07:00
Abhishek Pandit-Subedi
4867bd007d Bluetooth: Pause discovery and advertising during suspend
To prevent spurious wake ups, we disable any discovery or advertising
when we enter suspend and restore it when we exit suspend. While paused,
we disable any management requests to modify discovery or advertising.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:03:49 +01:00
Abhishek Pandit-Subedi
dd522a7429 Bluetooth: Handle LE devices during suspend
To handle LE devices, we must first disable passive scanning and
disconnect all connected devices. Once that is complete, we update the
whitelist and re-enable scanning

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:02:46 +01:00
Abhishek Pandit-Subedi
4f40afc6c7 Bluetooth: Handle BR/EDR devices during suspend
To handle BR/EDR devices, we first disable page scan and disconnect all
connected devices. Once that is complete, we add event filters (for
devices that can wake the system) and re-enable page scan.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:01:47 +01:00
Abhishek Pandit-Subedi
9952d90ea2 Bluetooth: Handle PM_SUSPEND_PREPARE and PM_POST_SUSPEND
Register for PM_SUSPEND_PREPARE and PM_POST_SUSPEND to make sure the
Bluetooth controller is prepared correctly for suspend/resume. Implement
the registration, scheduling and task handling portions only in this
patch.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:00:48 +01:00
Joseph Hwang
72da7b2cca Bluetooth: mgmt: add mgmt_cmd_status in add_advertising
If an error occurs during request building in add_advertising(),
remember to send MGMT_STATUS_FAILED command status back to bluetoothd.

Signed-off-by: Joseph Hwang <josephsih@chromium.org>
Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 15:33:56 +01:00
Alain Michaud
8a5956197d Bluetooth: fix off by one in err_data_reporting cmd masks.
This change fixes the off by one error in the erroneous command bit
masks which can lead to the erroneous data commands being sent to a
controller that doesn't support them.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 15:25:15 +01:00
John Fastabend
90db6d772f bpf, sockmap: Remove bucket->lock from sock_{hash|map}_free
The bucket->lock is not needed in the sock_hash_free and sock_map_free
calls, in fact it is causing a splat due to being inside rcu block.

| BUG: sleeping function called from invalid context at net/core/sock.c:2935
| in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 62, name: kworker/0:1
| 3 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffff8881381f6df8 (&stab->lock){+...}, at: sock_map_free+0x26/0x180
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04008-g7b083332376e #454
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  ___might_sleep.cold+0xa6/0xb6
|  lock_sock_nested+0x28/0x90
|  sock_map_free+0x5f/0x180
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50

The reason we have stab->lock and bucket->locks in sockmap code is to
handle checking EEXIST in update/delete cases. We need to be careful during
an update operation that we check for EEXIST and we need to ensure that the
psock object is not in some partial state of removal/insertion while we do
this. So both map_update_common and sock_map_delete need to guard from being
run together potentially deleting an entry we are checking, etc. But by the
time we get to the tear-down code in sock_{ma[|hash}_free we have already
disconnected the map and we just did synchronize_rcu() in the line above so
no updates/deletes should be in flight. Because of this we can drop the
bucket locks from the map free'ing code, noting no update/deletes can be
in-flight.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Jakub Sitnicki <jakub@cloudflare.com>
Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/158385850787.30597.8346421465837046618.stgit@john-Precision-5820-Tower
2020-03-11 14:08:52 +01:00
Marcel Holtmann
00b383b8ab Bluetooth: Use bt_dev_err for RPA generation failure message
When the RPA generation fails, indicate the error with a device specifc
error message.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-03-11 10:05:58 +02:00
Nicolas Cavallari
ba32679cac mac80211: Do not send mesh HWMP PREQ if HWMP is disabled
When trying to transmit to an unknown destination, the mesh code would
unconditionally transmit a HWMP PREQ even if HWMP is not the current
path selection algorithm.

Signed-off-by: Nicolas Cavallari <nicolas.cavallari@green-communications.fr>
Link: https://lore.kernel.org/r/20200305140409.12204-1-cavallar@lri.fr
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-11 09:04:14 +01:00
Jakub Kicinski
5cde05c61c nl80211: add missing attribute validation for channel switch
Add missing attribute validation for NL80211_ATTR_OPER_CLASS
to the netlink policy.

Fixes: 1057d35ede ("cfg80211: introduce TDLS channel switch commands")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-4-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-11 08:58:39 +01:00
Jakub Kicinski
056e9375e1 nl80211: add missing attribute validation for beacon report scanning
Add missing attribute validation for beacon report scanning
to the netlink policy.

Fixes: 1d76250bd3 ("nl80211: support beacon report scanning")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-3-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-11 08:58:31 +01:00
Jakub Kicinski
0e1a1d853e nl80211: add missing attribute validation for critical protocol indication
Add missing attribute validation for critical protocol fields
to the netlink policy.

Fixes: 5de1798489 ("cfg80211: introduce critical protocol indication from user-space")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20200303051058.4089398-2-kuba@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-11 08:58:27 +01:00
Lukas Wunner
1e09e5818b pktgen: Allow on loopback device
When pktgen is used to measure the performance of dev_queue_xmit()
packet handling in the core, it is preferable to not hand down
packets to a low-level Ethernet driver as it would distort the
measurements.

Allow using pktgen on the loopback device, thus constraining
measurements to core code.

Signed-off-by: Lukas Wunner <lukas@wunner.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 15:44:59 -07:00
Karsten Graul
ece0d7bd74 net/smc: cancel event worker during device removal
During IB device removal, cancel the event worker before the device
structure is freed.

Fixes: a4cf0443c4 ("smc: introduce SMC as an IB-client")
Reported-by: syzbot+b297c6825752e7a07272@syzkaller.appspotmail.com
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 15:40:33 -07:00
Hangbin Liu
60380488e4 ipv6/addrconf: call ipv6_mc_up() for non-Ethernet interface
Rafał found an issue that for non-Ethernet interface, if we down and up
frequently, the memory will be consumed slowly.

The reason is we add allnodes/allrouters addressed in multicast list in
ipv6_add_dev(). When link down, we call ipv6_mc_down(), store all multicast
addresses via mld_add_delrec(). But when link up, we don't call ipv6_mc_up()
for non-Ethernet interface to remove the addresses. This makes idev->mc_tomb
getting bigger and bigger. The call stack looks like:

addrconf_notify(NETDEV_REGISTER)
	ipv6_add_dev
		ipv6_dev_mc_inc(ff01::1)
		ipv6_dev_mc_inc(ff02::1)
		ipv6_dev_mc_inc(ff02::2)

addrconf_notify(NETDEV_UP)
	addrconf_dev_config
		/* Alas, we support only Ethernet autoconfiguration. */
		return;

addrconf_notify(NETDEV_DOWN)
	addrconf_ifdown
		ipv6_mc_down
			igmp6_group_dropped(ff02::2)
				mld_add_delrec(ff02::2)
			igmp6_group_dropped(ff02::1)
			igmp6_group_dropped(ff01::1)

After investigating, I can't found a rule to disable multicast on
non-Ethernet interface. In RFC2460, the link could be Ethernet, PPP, ATM,
tunnels, etc. In IPv4, it doesn't check the dev type when calls ip_mc_up()
in inetdev_event(). Even for IPv6, we don't check the dev type and call
ipv6_add_dev(), ipv6_dev_mc_inc() after register device.

So I think it's OK to fix this memory consumer by calling ipv6_mc_up() for
non-Ethernet interface.

v2: Also check IFF_MULTICAST flag to make sure the interface supports
    multicast

Reported-by: Rafał Miłecki <zajec5@gmail.com>
Tested-by: Rafał Miłecki <zajec5@gmail.com>
Fixes: 74235a25c6 ("[IPV6] addrconf: Fix IPv6 on tuntap tunnels")
Fixes: 1666d49e1d ("mld: do not remove mld souce list info when set link down")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 15:37:49 -07:00
Shakeel Butt
d752a49865 net: memcg: late association of sock to memcg
If a TCP socket is allocated in IRQ context or cloned from unassociated
(i.e. not associated to a memcg) in IRQ context then it will remain
unassociated for its whole life. Almost half of the TCPs created on the
system are created in IRQ context, so, memory used by such sockets will
not be accounted by the memcg.

This issue is more widespread in cgroup v1 where network memory
accounting is opt-in but it can happen in cgroup v2 if the source socket
for the cloning was created in root memcg.

To fix the issue, just do the association of the sockets at the accept()
time in the process context and then force charge the memory buffer
already used and reserved by the socket.

Signed-off-by: Shakeel Butt <shakeelb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 15:33:05 -07:00
Jens Axboe
0a384abfae net: abstract out normal and compat msghdr import
This splits it into two parts, one that imports the message, and one
that imports the iovec. This allows a caller to only do the first part,
and import the iovec manually afterwards.

No functional changes in this patch.

Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:49 -06:00
Florian Westphal
ec33916d47 mptcp: don't grow mptcp socket receive buffer when rcvbuf is locked
The mptcp rcvbuf size is adjusted according to the subflow rcvbuf size.
This should not be done if userspace did set a fixed value.

Fixes: 600911ff5f ("mptcp: add rmem queue accounting")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09 19:30:08 -07:00
David S. Miller
6c1a55feca This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - Avoid RCU list-traversal in spinlock, by Sven Eckelmann
 
  - Replace zero-length array with flexible-array member,
    by Gustavo A. R. Silva
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl5iPgIWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoZL8D/9LHWjIS4wlyhm40ZS4GKOqroSL
 4kPWutv4RIC/E5VXOukYG+p52GTSYDyCSAshviq0Z9/tIXJpwPwWkjO5wJ0zQ1Po
 PmPNcWgIi/Iyx7KXWD8fm7fdpfsQptgVfjP8OshS4in5gOjTe8D2bN3XM2q4CGT7
 pj4/uV+x/6CGKW7pHEpQlIeMu4mRtAcglaPZ03QzuTpKooHkgGi8gsFcVa1ZWzlU
 7i42QEcJcfjfhIlyE7xynR8NdPkC2kEdxBGjBn05gTNPswLlaub6bdnTUUfwsd0R
 Vvz2ti06uqfLwIYC/2G05nNCejVIEczJvKZTHnHsegtuTB6PEbi3r/tpB8C3gIbK
 hZIe36Dm1nmhALXjoM3OzNlm5brw5+P0bX8zzgFwmqYOzpaV+XGhtK8aFox1oEkC
 BltnhjZOtwk0UY+72vV+1iuBv/VYpxzv5UPiE+L8TRP9qkyygmTh/ZcThkW0J/25
 MU2T3yiycada0IdSOxS4E4TFo4CS5tZqS6O440qnX1GTxgCojM6Qsp5ypM+nbbGC
 07tcp8RjMXrH2S5imiDtgSBzixIRSQYHEm5lEMflpH8RJlUQK1MGPF6dltDzaCy9
 7UAoJTVCiMa6RvbgDSUeeNPM7ggUFztF0in2Ok9M9GpcnkKouy2lrdXAeKe2cfq+
 nfeZpPbV5mIIPayrZg==
 =f2xH
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20200306' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Avoid RCU list-traversal in spinlock, by Sven Eckelmann

 - Replace zero-length array with flexible-array member,
   by Gustavo A. R. Silva
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09 19:14:18 -07:00
David S. Miller
2362059427 Here is a batman-adv bugfix:
- Don't schedule OGM for disabled interface, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl5iPBEWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoX+wEAC3nG5q5Yu6G81vSdQiu8ey4xjR
 CIGl+kSxX/ryK7GaEhvSZ+/rdFbVsJN276CjvLTbnwIHf0QJV6a6A68+Xo0qG6Nc
 7Bpo9VP43GQJfmzFdqMQz5W/oVd88XKCfF7V9Cf+FIMkBTmuuxGfmNKcjpJVWZD4
 imQihGpknW/QUjlKuIVHD79ZEBn75iCR0HYK8Xuc88XIDUUDA/fROdXJS8jge4BM
 4UBFn46xQndBAVOJkYwd2FKhLwLT0YVJqMLBoQBf3Lq5Td6R2x6yrzXfyzNlKXEp
 oaZsHSB0Zo9v7ICxqY+CA0Yk8SxRrbGV2+cxjOHgudDcDnWF0MELcah0KOmcsPcx
 +G+dacTECrI0lxDu+LAWL//weKjHVh7WOeJk8wqRDgSV7AmCrSZS83v1cRXPQF41
 0UTKbWPnCe6TANmgllt+H7ere2O9vGbrLPD9jLH8kE1eDgfG1lLC8PjIxA5L/6T5
 sEvDg2/6JdQtAjqrAc+MP4hm4H2yEzK+n6ozA1Fn9gXn1RiKFr/zOB33PyaHUkUL
 K2MXVZANW0RLKRIb/zXQ8bJcMEAU7aeHZytm7n/JmRpSkRV1uvJSgeOF/NvZFg9P
 uBE0TO2RzXmJyUBzdSdZfrCSPobFbR1Gs9Uks+1U5FXjxw6j9adx+/5WuRbhJii1
 qdv3ueo7awMeX3RrJg==
 =JJc9
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20200306' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - Don't schedule OGM for disabled interface, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09 19:08:43 -07:00
Dmitry Yakunin
018d26fcd1 cgroup, netclassid: periodically release file_lock on classid updating
In our production environment we have faced with problem that updating
classid in cgroup with heavy tasks cause long freeze of the file tables
in this tasks. By heavy tasks we understand tasks with many threads and
opened sockets (e.g. balancers). This freeze leads to an increase number
of client timeouts.

This patch implements following logic to fix this issue:
аfter iterating 1000 file descriptors file table lock will be released
thus providing a time gap for socket creation/deletion.

Now update is non atomic and socket may be skipped using calls:

dup2(oldfd, newfd);
close(oldfd);

But this case is not typical. Moreover before this patch skip is possible
too by hiding socket fd in unix socket buffer.

New sockets will be allocated with updated classid because cgroup state
is updated before start of the file descriptors iteration.

So in common cases this patch has no side effects.

Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09 18:13:39 -07:00
Leslie Monis
3f95f55eb5 net: sched: pie: change tc_pie_xstats->prob
Commit 105e808c1d ("pie: remove pie_vars->accu_prob_overflows")
changes the scale of probability values in PIE from (2^64 - 1) to
(2^56 - 1). This affects the precision of tc_pie_xstats->prob in
user space.

This patch ensures user space is unaffected.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09 18:05:55 -07:00
Yousuk Seung
e08ab0b377 tcp: add bytes not sent to SCM_TIMESTAMPING_OPT_STATS
Add TCP_NLA_BYTES_NOTSENT to SCM_TIMESTAMPING_OPT_STATS that reports
bytes in the write queue but not sent. This is the same metric as
what is exported with tcp_info.tcpi_notsent_bytes.

Signed-off-by: Yousuk Seung <ysseung@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-09 17:56:33 -07:00
Lorenz Bauer
7b98cd42b0 bpf: sockmap: Add UDP support
Allow adding hashed UDP sockets to sockmaps.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-9-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
edc6741cc6 bpf: Add sockmap hooks for UDP sockets
Add basic psock hooks for UDP sockets. This allows adding and
removing sockets, as well as automatic removal on unhash and close.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-8-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
cb21802b39 bpf: sockmap: Simplify sock_map_init_proto
We can take advantage of the fact that both callers of
sock_map_init_proto are holding a RCU read lock, and
have verified that psock is valid.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-7-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
f747632b60 bpf: sockmap: Move generic sockmap hooks from BPF TCP
The init, close and unhash handlers from TCP sockmap are generic,
and can be reused by UDP sockmap. Move the helpers into the sockmap code
base and expose them. This requires tcp_bpf_get_proto and tcp_bpf_clone to
be conditional on BPF_STREAM_PARSER.

The moved functions are unmodified, except that sk_psock_unlink is
renamed to sock_map_unlink to better match its behaviour.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-6-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
d19da360ee bpf: tcp: Move assertions into tcp_bpf_get_proto
We need to ensure that sk->sk_prot uses certain callbacks, so that
code that directly calls e.g. tcp_sendmsg in certain corner cases
works. To avoid spurious asserts, we must to do this only if
sk_psock_update_proto has not yet been called. The same invariants
apply for tcp_bpf_check_v6_needs_rebuild, so move the call as well.

Doing so allows us to merge tcp_bpf_init and tcp_bpf_reinit.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-4-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
1a2e20132d skmsg: Update saved hooks only once
Only update psock->saved_* if psock->sk_proto has not been initialized
yet. This allows us to get rid of tcp_bpf_reinit_sk_prot.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-3-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
7b70973d7e bpf: sockmap: Only check ULP for TCP sockets
The sock map code checks that a socket does not have an active upper
layer protocol before inserting it into the map. This requires casting
via inet_csk, which isn't valid for UDP sockets.

Guard checks for ULP by checking inet_sk(sk)->is_icsk first.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-2-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Dmitry Yakunin
83f73c5bb7 inet_diag: return classid for all socket types
In commit 1ec17dbd90 ("inet_diag: fix reporting cgroup classid and
fallback to priority") croup classid reporting was fixed. But this works
only for TCP sockets because for other socket types icsk parameter can
be NULL and classid code path is skipped. This change moves classid
handling to inet_diag_msg_attrs_fill() function.

Also inet_diag_msg_attrs_size() helper was added and addends in
nlmsg_new() were reordered to save order from inet_sk_diag_fill().

Fixes: 1ec17dbd90 ("inet_diag: fix reporting cgroup classid and fallback to priority")
Signed-off-by: Dmitry Yakunin <zeil@yandex-team.ru>
Reviewed-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:57:48 -07:00
Eric Dumazet
138470a9b2 net/sched: act_ct: fix lockdep splat in tcf_ct_flow_table_get
Convert zones_lock spinlock to zones_mutex mutex,
and struct (tcf_ct_flow_table)->ref to a refcount,
so that control path can use regular GFP_KERNEL allocations
from standard process context. This is more robust
in case of memory pressure.

The refcount is needed because tcf_ct_flow_table_put() can
be called from RCU callback, thus in BH context.

The issue was spotted by syzbot, as rhashtable_init()
was called with a spinlock held, which is bad since GFP_KERNEL
allocations can sleep.

Note to developers : Please make sure your patches are tested
with CONFIG_DEBUG_ATOMIC_SLEEP=y

BUG: sleeping function called from invalid context at mm/slab.h:565
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 9582, name: syz-executor610
2 locks held by syz-executor610/9582:
 #0: ffffffff8a34eb80 (rtnl_mutex){+.+.}, at: rtnl_lock net/core/rtnetlink.c:72 [inline]
 #0: ffffffff8a34eb80 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x3f9/0xad0 net/core/rtnetlink.c:5437
 #1: ffffffff8a3961b8 (zones_lock){+...}, at: spin_lock_bh include/linux/spinlock.h:343 [inline]
 #1: ffffffff8a3961b8 (zones_lock){+...}, at: tcf_ct_flow_table_get+0xa3/0x1700 net/sched/act_ct.c:67
Preemption disabled at:
[<0000000000000000>] 0x0
CPU: 0 PID: 9582 Comm: syz-executor610 Not tainted 5.6.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x188/0x20d lib/dump_stack.c:118
 ___might_sleep.cold+0x1f4/0x23d kernel/sched/core.c:6798
 slab_pre_alloc_hook mm/slab.h:565 [inline]
 slab_alloc_node mm/slab.c:3227 [inline]
 kmem_cache_alloc_node_trace+0x272/0x790 mm/slab.c:3593
 __do_kmalloc_node mm/slab.c:3615 [inline]
 __kmalloc_node+0x38/0x60 mm/slab.c:3623
 kmalloc_node include/linux/slab.h:578 [inline]
 kvmalloc_node+0x61/0xf0 mm/util.c:574
 kvmalloc include/linux/mm.h:645 [inline]
 kvzalloc include/linux/mm.h:653 [inline]
 bucket_table_alloc+0x8b/0x480 lib/rhashtable.c:175
 rhashtable_init+0x3d2/0x750 lib/rhashtable.c:1054
 nf_flow_table_init+0x16d/0x310 net/netfilter/nf_flow_table_core.c:498
 tcf_ct_flow_table_get+0xe33/0x1700 net/sched/act_ct.c:82
 tcf_ct_init+0xba4/0x18a6 net/sched/act_ct.c:1050
 tcf_action_init_1+0x697/0xa20 net/sched/act_api.c:945
 tcf_action_init+0x1e9/0x2f0 net/sched/act_api.c:1001
 tcf_action_add+0xdb/0x370 net/sched/act_api.c:1411
 tc_ctl_action+0x366/0x456 net/sched/act_api.c:1466
 rtnetlink_rcv_msg+0x44e/0xad0 net/core/rtnetlink.c:5440
 netlink_rcv_skb+0x15a/0x410 net/netlink/af_netlink.c:2478
 netlink_unicast_kernel net/netlink/af_netlink.c:1303 [inline]
 netlink_unicast+0x537/0x740 net/netlink/af_netlink.c:1329
 netlink_sendmsg+0x882/0xe10 net/netlink/af_netlink.c:1918
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:672
 ____sys_sendmsg+0x6b9/0x7d0 net/socket.c:2343
 ___sys_sendmsg+0x100/0x170 net/socket.c:2397
 __sys_sendmsg+0xec/0x1b0 net/socket.c:2430
 do_syscall_64+0xf6/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4403d9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd719af218 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004403d9
RDX: 0000000000000000 RSI: 0000000020000300 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000005 R09: 00000000004002c8
R10: 0000000000000008 R11: 00000000000

Fixes: c34b961a24 ("net/sched: act_ct: Create nf flow table per zone")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Paul Blakey <paulb@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:47:20 -07:00
Eric Dumazet
17c25cafd4 gre: fix uninit-value in __iptunnel_pull_header
syzbot found an interesting case of the kernel reading
an uninit-value [1]

Problem is in the handling of ETH_P_WCCP in gre_parse_header()

We look at the byte following GRE options to eventually decide
if the options are four bytes longer.

Use skb_header_pointer() to not pull bytes if we found
that no more bytes were needed.

All callers of gre_parse_header() are properly using pskb_may_pull()
anyway before proceeding to next header.

[1]
BUG: KMSAN: uninit-value in pskb_may_pull include/linux/skbuff.h:2303 [inline]
BUG: KMSAN: uninit-value in __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94
CPU: 1 PID: 11784 Comm: syz-executor940 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 pskb_may_pull include/linux/skbuff.h:2303 [inline]
 __iptunnel_pull_header+0x30c/0xbd0 net/ipv4/ip_tunnel_core.c:94
 iptunnel_pull_header include/net/ip_tunnels.h:411 [inline]
 gre_rcv+0x15e/0x19c0 net/ipv6/ip6_gre.c:606
 ip6_protocol_deliver_rcu+0x181b/0x22c0 net/ipv6/ip6_input.c:432
 ip6_input_finish net/ipv6/ip6_input.c:473 [inline]
 NF_HOOK include/linux/netfilter.h:307 [inline]
 ip6_input net/ipv6/ip6_input.c:482 [inline]
 ip6_mc_input+0xdf2/0x1460 net/ipv6/ip6_input.c:576
 dst_input include/net/dst.h:442 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
 NF_HOOK include/linux/netfilter.h:307 [inline]
 ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:306
 __netif_receive_skb_one_core net/core/dev.c:5198 [inline]
 __netif_receive_skb net/core/dev.c:5312 [inline]
 netif_receive_skb_internal net/core/dev.c:5402 [inline]
 netif_receive_skb+0x66b/0xf20 net/core/dev.c:5461
 tun_rx_batched include/linux/skbuff.h:4321 [inline]
 tun_get_user+0x6aef/0x6f60 drivers/net/tun.c:1997
 tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026
 call_write_iter include/linux/fs.h:1901 [inline]
 new_sync_write fs/read_write.c:483 [inline]
 __vfs_write+0xa5a/0xca0 fs/read_write.c:496
 vfs_write+0x44a/0x8f0 fs/read_write.c:558
 ksys_write+0x267/0x450 fs/read_write.c:611
 __do_sys_write fs/read_write.c:623 [inline]
 __se_sys_write fs/read_write.c:620 [inline]
 __ia32_sys_write+0xdb/0x120 fs/read_write.c:620
 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
 do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410
 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7f62d99
Code: 90 e8 0b 00 00 00 f3 90 0f ae e8 eb f9 8d 74 26 00 89 3c 24 c3 90 90 90 90 90 90 90 90 90 90 90 90 51 52 55 89 e5 0f 34 cd 80 <5d> 5a 59 c3 90 90 90 90 eb 0d 90 90 90 90 90 90 90 90 90 90 90 90
RSP: 002b:00000000fffedb2c EFLAGS: 00000217 ORIG_RAX: 0000000000000004
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000020002580
RDX: 0000000000000fca RSI: 0000000000000036 RDI: 0000000000000004
RBP: 0000000000008914 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
 slab_alloc_node mm/slub.c:2793 [inline]
 __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4401
 __kmalloc_reserve net/core/skbuff.c:142 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:210
 alloc_skb include/linux/skbuff.h:1051 [inline]
 alloc_skb_with_frags+0x18c/0xa70 net/core/skbuff.c:5766
 sock_alloc_send_pskb+0xada/0xc60 net/core/sock.c:2242
 tun_alloc_skb drivers/net/tun.c:1529 [inline]
 tun_get_user+0x10ae/0x6f60 drivers/net/tun.c:1843
 tun_chr_write_iter+0x1f2/0x360 drivers/net/tun.c:2026
 call_write_iter include/linux/fs.h:1901 [inline]
 new_sync_write fs/read_write.c:483 [inline]
 __vfs_write+0xa5a/0xca0 fs/read_write.c:496
 vfs_write+0x44a/0x8f0 fs/read_write.c:558
 ksys_write+0x267/0x450 fs/read_write.c:611
 __do_sys_write fs/read_write.c:623 [inline]
 __se_sys_write fs/read_write.c:620 [inline]
 __ia32_sys_write+0xdb/0x120 fs/read_write.c:620
 do_syscall_32_irqs_on arch/x86/entry/common.c:339 [inline]
 do_fast_syscall_32+0x3c7/0x6e0 arch/x86/entry/common.c:410
 entry_SYSENTER_compat+0x68/0x77 arch/x86/entry/entry_64_compat.S:139

Fixes: 95f5c64c3c ("gre: Move utility functions to common headers")
Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:25:37 -07:00
Jiri Pirko
44f8658017 sched: act: allow user to specify type of HW stats for a filter
Currently, user who is adding an action expects HW to report stats,
however it does not have exact expectations about the stats types.
That is aligned with TCA_ACT_HW_STATS_TYPE_ANY.

Allow user to specify the type of HW stats for an action and require it.

Pass the information down to flow_offload layer.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Jiri Pirko
319a1d1947 flow_offload: check for basic action hw stats type
Introduce flow_action_basic_hw_stats_types_check() helper and use it
in drivers. That sanitizes the drivers which do not have support
for action HW stats types.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Qiujun Huang
71811cac85 Bluetooth: RFCOMM: fix ODEBUG bug in rfcomm_dev_ioctl
Needn't call 'rfcomm_dlc_put' here, because 'rfcomm_dlc_exists' didn't
increase dlc->refcnt.

Reported-by: syzbot+4496e82090657320efc6@syzkaller.appspotmail.com
Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Suggested-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 14:44:58 +01:00
Luiz Augusto von Dentz
4be5ca67d5 Bluetooth: L2CAP: Add module option to enable ECRED mode
This should make it safe to have the code upstream without affecting
stable systems since there are a few details not sort out with ECRED
mode e.g: how to initiate multiple connections at once.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 10:05:40 +01:00
Luiz Augusto von Dentz
15f02b9105 Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode
This adds the initial code for Enhanced Credit Based Mode which
introduces a new socket mode called L2CAP_MODE_EXT_FLOWCTL, which for
the most part work the same as L2CAP_MODE_LE_FLOWCTL but uses different
PDUs to setup the connections and also works over BR/EDR.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 10:05:40 +01:00
Marcel Holtmann
985048f427 Bluetooth: Increment management interface revision
Increment the mgmt revision due to the recently added setting and
command.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-03-08 10:49:47 +02:00
Alain Michaud
00bce3fb06 Bluetooth: Enable erroneous data reporting if WBS is supported
This change introduces a wide band speech setting which allows higher
level clients to query the local controller support for wide band speech
as well as set the setting state when the radio is powered off.
Internally, this setting controls if erroneous data reporting is enabled
on the controller.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 08:50:42 +01:00
Luiz Augusto von Dentz
55cee73e2a Bluetooth: Make use of skb_pull to parse L2CAP signaling PDUs
This uses skb_pull when parsing signalling PDUs so skb->data for
pointing to the current PDU and skb->len as the remaining bytes to be
processed.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 08:44:22 +01:00
Marcel Holtmann
debdedf2eb Bluetooth: Fix calculation of SCO handle for packet processing
When processing SCO packets, the handle is wrongly assumed as 16-bit
value. The actual size is 12-bits and the other 4-bits are used for
packet flags.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-03-07 09:50:15 +02:00
David S. Miller
357ddbb9bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Patches to bump position index from sysctl seq_next,
   from Vasilin Averin.

2) Release flowtable hook from error path, from Florian Westphal.

3) Patches to add missing netlink attribute validation,
   from Jakub Kicinski.

4) Missing NFTA_CHAIN_FLAGS in nf_tables_fill_chain_info().

5) Infinite loop in module autoload if extension is not available,
   from Florian Westphal.

6) Missing module ownership in inet/nat chain type definition.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-06 21:36:25 -08:00
Pablo Neira Ayuso
6a42cefb25 netfilter: nft_chain_nat: inet family is missing module ownership
Set owner to THIS_MODULE, otherwise the nft_chain_nat module might be
removed while there are still inet/nat chains in place.

[  117.942096] BUG: unable to handle page fault for address: ffffffffa0d5e040
[  117.942101] #PF: supervisor read access in kernel mode
[  117.942103] #PF: error_code(0x0000) - not-present page
[  117.942106] PGD 200c067 P4D 200c067 PUD 200d063 PMD 3dc909067 PTE 0
[  117.942113] Oops: 0000 [#1] PREEMPT SMP PTI
[  117.942118] CPU: 3 PID: 27 Comm: kworker/3:0 Not tainted 5.6.0-rc3+ #348
[  117.942133] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
[  117.942145] RIP: 0010:nf_tables_chain_destroy.isra.0+0x94/0x15a [nf_tables]
[  117.942149] Code: f6 45 54 01 0f 84 d1 00 00 00 80 3b 05 74 44 48 8b 75 e8 48 c7 c7 72 be de a0 e8 56 e6 2d e0 48 8b 45 e8 48 c7 c7 7f be de a0 <48> 8b 30 e8 43 e6 2d e0 48 8b 45 e8 48 8b 40 10 48 85 c0 74 5b 8b
[  117.942152] RSP: 0018:ffffc9000015be10 EFLAGS: 00010292
[  117.942155] RAX: ffffffffa0d5e040 RBX: ffff88840be87fc2 RCX: 0000000000000007
[  117.942158] RDX: 0000000000000007 RSI: 0000000000000086 RDI: ffffffffa0debe7f
[  117.942160] RBP: ffff888403b54b50 R08: 0000000000001482 R09: 0000000000000004
[  117.942162] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8883eda7e540
[  117.942164] R13: dead000000000122 R14: dead000000000100 R15: ffff888403b3db80
[  117.942167] FS:  0000000000000000(0000) GS:ffff88840e4c0000(0000) knlGS:0000000000000000
[  117.942169] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  117.942172] CR2: ffffffffa0d5e040 CR3: 00000003e4c52002 CR4: 00000000001606e0
[  117.942174] Call Trace:
[  117.942188]  nf_tables_trans_destroy_work.cold+0xd/0x12 [nf_tables]
[  117.942196]  process_one_work+0x1d6/0x3b0
[  117.942200]  worker_thread+0x45/0x3c0
[  117.942203]  ? process_one_work+0x3b0/0x3b0
[  117.942210]  kthread+0x112/0x130
[  117.942214]  ? kthread_create_worker_on_cpu+0x40/0x40
[  117.942221]  ret_from_fork+0x35/0x40

nf_tables_chain_destroy() crashes on module_put() because the module is
gone.

Fixes: d164385ec5 ("netfilter: nat: add inet family nat support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-06 18:00:43 +01:00
Paolo Abeni
2398e3991b mptcp: always include dack if possible.
Currently passive MPTCP socket can skip including the DACK
option - if the peer sends data before accept() completes.

The above happens because the msk 'can_ack' flag is set
only after the accept() call.

Such missing DACK option may cause - as per RFC spec -
unwanted fallback to TCP.

This change addresses the issue using the key material
available in the current subflow, if any, to create a suitable
dack option when msk ack seq is not yet available.

v1 -> v2:
 - adavance the generated ack after the initial MPC packet

Fixes: d22f4988ff ("mptcp: process MP_CAPABLE data option")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 21:34:42 -08:00
Dan Carpenter
a3aefbfe45 net: nfc: fix bounds checking bugs on "pipe"
This is similar to commit 674d9de02a ("NFC: Fix possible memory
corruption when handling SHDLC I-Frame commands") and commit d7ee81ad09
("NFC: nci: Add some bounds checking in nci_hci_cmd_received()") which
added range checks on "pipe".

The "pipe" variable comes skb->data[0] in nfc_hci_msg_rx_work().
It's in the 0-255 range.  We're using it as the array index into the
hdev->pipes[] array which has NFC_HCI_MAX_PIPES (128) members.

Fixes: 118278f20a ("NFC: hci: Add pipes table to reference them with a tuple {gate, host}")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 21:32:42 -08:00
Petr Machata
aaca940807 net: sched: Make FIFO Qdisc offloadable
Invoke ndo_setup_tc() as appropriate to signal init / replacement,
destroying and dumping of pFIFO / bFIFO Qdisc.

A lot of the FIFO logic is used for pFIFO_head_drop as well, but that's a
semantically very different Qdisc that isn't really in the same boat as
pFIFO / bFIFO. Split some of the functions to keep the Qdisc intact.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 14:03:31 -08:00
Jakub Kicinski
95cddcb5cc ethtool: add infrastructure for centralized checking of coalescing parameters
Linux supports 22 different interrupt coalescing parameters.
No driver implements them all. Some drivers just ignore the
ones they don't support, while others have to carry a long
list of checks to reject unsupported settings.

To simplify the drivers add the ability to specify inside
ethtool_ops which parameters are supported and let the core
reject attempts to set any other one.

This commit makes the mechanism an opt-in, only drivers which
set ethtool_opts->coalesce_types to a non-zero value will have
the checks enforced.

The same mask is used for global and per queue settings.

v3: - move the (temporary) check if driver defines types
      earlier (Michal)
    - rename used_types -> nonzero_params, and
      coalesce_types -> supported_coalesce_params (Alex)
    - use EOPNOTSUPP instead of EINVAL (Andrew, Michal)

Leaving the long series of ifs for now, it seems nice to
be able to grep for the field and flag names. This will
probably have to be revisited once netlink support lands.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 12:12:34 -08:00
Taehee Yoo
56dc0a0eac hsr: fix refcnt leak of hsr slave interface
In the commit e0a4b99773 ("hsr: use upper/lower device infrastructure"),
dev_get() was removed but dev_put() in the error path wasn't removed.
So, if creating hsr interface command is failed, the reference counter leak
of lower interface would occur.

Test commands:
    ip link add dummy0 type dummy
    ip link add ipvlan0 link dummy0 type ipvlan mode l2
    ip link add ipvlan1 link dummy0 type ipvlan mode l2
    ip link add hsr0 type hsr slave1 ipvlan0 slave2 ipvlan1
    ip link del ipvlan0

Result:
[  633.271992][ T1280] unregister_netdevice: waiting for ipvlan0 to become free. Usage count = -1

Fixes: e0a4b99773 ("hsr: use upper/lower device infrastructure")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 11:59:47 -08:00
Florian Westphal
1d305ba40e netfilter: nf_tables: fix infinite loop when expr is not available
nft will loop forever if the kernel doesn't support an expression:

1. nft_expr_type_get() appends the family specific name to the module list.
2. -EAGAIN is returned to nfnetlink, nfnetlink calls abort path.
3. abort path sets ->done to true and calls request_module for the
   expression.
4. nfnetlink replays the batch, we end up in nft_expr_type_get() again.
5. nft_expr_type_get attempts to append family-specific name. This
   one already exists on the list, so we continue
6. nft_expr_type_get adds the generic expression name to the module
   list. -EAGAIN is returned, nfnetlink calls abort path.
7. abort path encounters the family-specific expression which
   has 'done' set, so it gets removed.
8. abort path requests the generic expression name, sets done to true.
9. batch is replayed.

If the expression could not be loaded, then we will end up back at 1),
because the family-specific name got removed and the cycle starts again.

Note that userspace can SIGKILL the nft process to stop the cycle, but
the desired behaviour is to return an error after the generic expr name
fails to load the expression.

Fixes: eb014de4fd ("netfilter: nf_tables: autoload modules from the abort path")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-05 12:56:33 +01:00
Pablo Neira Ayuso
d78008de61 netfilter: nf_tables: dump NFTA_CHAIN_FLAGS attribute
Missing NFTA_CHAIN_FLAGS netlink attribute when dumping basechain
definitions.

Fixes: c9626a2cbd ("netfilter: nf_tables: add hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-05 11:35:57 +01:00
Vladimir Oltean
69df578c5f net: mscc: ocelot: eliminate confusion between CPU and NPI port
Ocelot has the concept of a CPU port. The CPU port is represented in the
forwarding and the queueing system, but it is not a physical device. The
CPU port can either be accessed via register-based injection/extraction
(which is the case of Ocelot), via Frame-DMA (similar to the first one),
or "connected" to a physical Ethernet port (called NPI in the datasheet)
which is the case of the Felix DSA switch.

In Ocelot the CPU port is at index 11.
In Felix the CPU port is at index 6.

The CPU bit is treated special in the forwarding, as it is never cleared
from the forwarding port mask (once added to it). Other than that, it is
treated the same as a normal front port.

Both Felix and Ocelot should use the CPU port in the same way. This
means that Felix should not use the NPI port directly when forwarding to
the CPU, but instead use the CPU port.

This patch is fixing this such that Felix will use port 6 as its CPU
port, and just use the NPI port to carry the traffic.

Therefore, eliminate the "ocelot->cpu" variable which was holding the
index of the NPI port for Felix, and the index of the CPU port module
for Ocelot, so the variable was actually configuring different things
for different drivers and causing at least part of the confusion.

Also remove the "ocelot->num_cpu_ports" variable, which is the result of
another confusion. The 2 CPU ports mentioned in the datasheet are
because there are two frame extraction channels (register based or DMA
based). This is of no relevance to the driver at the moment, and
invisible to the analyzer module.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Suggested-by: Allan W. Nielsen <allan.nielsen@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 14:19:00 -08:00
KP Singh
3d08b6f29c bpf: Add selftests for BPF_MODIFY_RETURN
Test for two scenarios:

  * When the fmod_ret program returns 0, the original function should
    be called along with fentry and fexit programs.
  * When the fmod_ret program returns a non-zero value, the original
    function should not be called, no side effect should be observed and
    fentry and fexit programs should be called.

The result from the kernel function call and whether a side-effect is
observed is returned via the retval attr of the BPF_PROG_TEST_RUN (bpf)
syscall.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-8-kpsingh@chromium.org
2020-03-04 13:41:06 -08:00
KP Singh
da00d2f117 bpf: Add test ops for BPF_PROG_TYPE_TRACING
The current fexit and fentry tests rely on a different program to
exercise the functions they attach to. Instead of doing this, implement
the test operations for tracing which will also be used for
BPF_MODIFY_RETURN in a subsequent patch.

Also, clean up the fexit test to use the generated skeleton.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200304191853.1529-7-kpsingh@chromium.org
2020-03-04 13:41:06 -08:00
Leslie Monis
105e808c1d pie: remove pie_vars->accu_prob_overflows
The variable pie_vars->accu_prob is used as an accumulator for
probability values. Since probabilty values are scaled using the
MAX_PROB macro denoting (2^64 - 1), pie_vars->accu_prob is
likely to overflow as it is of type u64.

The variable pie_vars->accu_prob_overflows counts the number of
times the variable pie_vars->accu_prob overflows.

The MAX_PROB macro needs to be equal to at least (2^39 - 1) in
order to do precise calculations without any underflow. Thus
MAX_PROB can be reduced to (2^56 - 1) without affecting the
precision in calculations drastically. Doing so will eliminate
the need for the variable pie_vars->accu_prob_overflows as the
variable pie_vars->accu_prob will never overflow.

Removing the variable pie_vars->accu_prob_overflows also reduces
the size of the structure pie_vars to exactly 64 bytes.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 13:25:55 -08:00
Leslie Monis
220d4ac74e pie: remove unnecessary type casting
In function pie_calculate_probability(), the variables alpha and
beta are of type u64. The variables qdelay, qdelay_old and
params->target are of type psched_time_t (which is also u64).
The explicit type casting done when calculating the value for
the variable delta is redundant and not required.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 13:25:55 -08:00
Leslie Monis
90baeb9dd2 pie: use term backlog instead of qlen
Remove ambiguity by using the term backlog instead of qlen when
representing the queue length in bytes.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 13:25:55 -08:00
Paul Blakey
4cc5fdec6d net/sched: act_ct: Use pskb_network_may_pull()
To make the filler functions more generic, use network
relative skb pulling.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 11:05:04 -08:00
Paul Blakey
07ac9d16b4 net/sched: act_ct: Fix ipv6 lookup of offloaded connections
When checking the protocol number tcf_ct_flow_table_lookup() handles
the flow as if it's always ipv4, while it can be ipv6.

Instead, refactor the code to fetch the tcp header, if available,
in the relevant family (ipv4/ipv6) filler function, and do the
check on the returned tcp header.

Fixes: 46475bb20f ("net/sched: act_ct: Software offload of established flows")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 11:05:04 -08:00
Xin Long
d1d17a359c esp: remove the skb from the chain when it's enqueued in cryptd_wq
Xiumei found a panic in esp offload:

  BUG: unable to handle kernel NULL pointer dereference at 0000000000000020
  RIP: 0010:esp_output_done+0x101/0x160 [esp4]
  Call Trace:
   ? esp_output+0x180/0x180 [esp4]
   cryptd_aead_crypt+0x4c/0x90
   cryptd_queue_worker+0x6e/0xa0
   process_one_work+0x1a7/0x3b0
   worker_thread+0x30/0x390
   ? create_worker+0x1a0/0x1a0
   kthread+0x112/0x130
   ? kthread_flush_work_fn+0x10/0x10
   ret_from_fork+0x35/0x40

It was caused by that skb secpath is used in esp_output_done() after it's
been released elsewhere.

The tx path for esp offload is:

  __dev_queue_xmit()->
    validate_xmit_skb_list()->
      validate_xmit_xfrm()->
        esp_xmit()->
          esp_output_tail()->
            aead_request_set_callback(esp_output_done) <--[1]
            crypto_aead_encrypt()  <--[2]

In [1], .callback is set, and in [2] it will trigger the worker schedule,
later on a kernel thread will call .callback(esp_output_done), as the call
trace shows.

But in validate_xmit_xfrm():

  skb_list_walk_safe(skb, skb2, nskb) {
    ...
    err = x->type_offload->xmit(x, skb2, esp_features);  [esp_xmit]
    ...
  }

When the err is -EINPROGRESS, which means this skb2 will be enqueued and
later gets encrypted and sent out by .callback later in a kernel thread,
skb2 should be removed fromt skb chain. Otherwise, it will get processed
again outside validate_xmit_xfrm(), which could release skb secpath, and
cause the panic above.

This patch is to remove the skb from the chain when it's enqueued in
cryptd_wq. While at it, remove the unnecessary 'if (!skb)' check.

Fixes: 3dca3f38cf ("xfrm: Separate ESP handling from segmentation for GRO packets.")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-03-04 12:41:29 +01:00
Alain Michaud
08bb4da901 Bluetooth: guard against controllers sending zero'd events
Some controllers have been observed to send zero'd events under some
conditions.  This change guards against this condition as well as adding
a trace to facilitate diagnosability of this condition.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-04 08:27:22 +01:00
Vladimir Oltean
ed11bb1f96 net: dsa: Add bypass operations for the flower classifier-action filter
Due to the immense variety of classification keys and actions available
for tc-flower, as well as due to potentially very different DSA switch
capabilities, it doesn't make a lot of sense for the DSA mid layer to
even attempt to interpret these. So just pass them on to the underlying
switch driver.

DSA implements just the standard boilerplate for binding and unbinding
flow blocks to ports, since nobody wants to deal with that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 18:57:49 -08:00
Bjorn Andersson
71046abfff net: qrtr: Fix FIXME related to qrtr_ns_init()
The 2 second delay before calling qrtr_ns_init() meant that the remote
processors would register as endpoints in qrtr and the say_hello() call
would therefor broadcast the outgoing HELLO to them. With the HELLO
handshake corrected this delay is no longer needed.

Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 17:52:21 -08:00
Bjorn Andersson
a1dc1d6a05 net: qrtr: Respond to HELLO message
Lost in the translation from the user space implementation was the
detail that HELLO mesages must be exchanged between each node pair.  As
such the incoming HELLO must be replied to.

Similar to the previous implementation no effort is made to prevent two
Linux boxes from continuously sending HELLO messages back and forth,
this is left to a follow up patch.

say_hello() is moved, to facilitate the new call site.

Fixes: 0c2204a4ad ("net: qrtr: Migrate nameservice to kernel from userspace")
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Tested-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 17:52:21 -08:00
Jakub Kicinski
88a637719a netfilter: nft_tunnel: add missing attribute validation for tunnels
Add missing attribute validation for tunnel source and
destination ports to the netlink policy.

Fixes: af308b94a2 ("netfilter: nf_tables: add tunnel support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:28:25 +01:00
Jakub Kicinski
9d6effb2f1 netfilter: nft_payload: add missing attribute validation for payload csum flags
Add missing attribute validation for NFTA_PAYLOAD_CSUM_FLAGS
to the netlink policy.

Fixes: 1814096980 ("netfilter: nft_payload: layer 4 checksum adjustment for pseudoheader fields")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:28:17 +01:00
Jakub Kicinski
c049b34500 netfilter: cthelper: add missing attribute validation for cthelper
Add missing attribute validation for cthelper
to the netlink policy.

Fixes: 12f7a50533 ("netfilter: add user-space connection tracking helper infrastructure")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:28:08 +01:00
Florian Westphal
2d285f26ec netfilter: nf_tables: free flowtable hooks on hook register error
If hook registration fails, the hooks allocated via nft_netdev_hook_alloc
need to be freed.

We can't change the goto label to 'goto 5' -- while it does fix the memleak
it does cause a warning splat from the netfilter core (the hooks were not
registered).

Fixes: 3f0465a9ef ("netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables")
Reported-by: syzbot+a2ff6fa45162a5ed4dd3@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:27:18 +01:00
Vasily Averin
ee84f19cbb netfilter: x_tables: xt_mttg_seq_next should increase position index
If .next function does not change position index,
following .show function will repeat output related
to current position index.

Without patch:
 # dd if=/proc/net/ip_tables_matches  # original file output
 conntrack
 conntrack
 conntrack
 recent
 recent
 icmp
 udplite
 udp
 tcp
 0+1 records in
 0+1 records out
 65 bytes copied, 5.4074e-05 s, 1.2 MB/s

 # dd if=/proc/net/ip_tables_matches bs=62 skip=1
 dd: /proc/net/ip_tables_matches: cannot skip to specified offset
 cp   <<< end of  last line
 tcp  <<< and then unexpected whole last line once again
 0+1 records in
 0+1 records out
 7 bytes copied, 0.000102447 s, 68.3 kB/s

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:27:18 +01:00
Vasily Averin
db25517a55 netfilter: xt_recent: recent_seq_next should increase position index
If .next function does not change position index,
following .show function will repeat output related
to current position index.

Without the patch:
 # dd if=/proc/net/xt_recent/SSH # original file outpt
 src=127.0.0.4 ttl: 0 last_seen: 6275444819 oldest_pkt: 1 6275444819
 src=127.0.0.2 ttl: 0 last_seen: 6275438906 oldest_pkt: 1 6275438906
 src=127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953
 0+1 records in
 0+1 records out
 204 bytes copied, 6.1332e-05 s, 3.3 MB/s

Read after lseek into middle of last line (offset 140 in example below)
generates expected end of last line and then unexpected whole last line
once again

 # dd if=/proc/net/xt_recent/SSH bs=140 skip=1
 dd: /proc/net/xt_recent/SSH: cannot skip to specified offset
 127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953
 src=127.0.0.3 ttl: 0 last_seen: 6275441953 oldest_pkt: 1 6275441953
 0+1 records in
 0+1 records out
 132 bytes copied, 6.2487e-05 s, 2.1 MB/s

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:27:18 +01:00
Vasily Averin
bb71f846a0 netfilter: synproxy: synproxy_cpu_seq_next should increase position index
If .next function does not change position index,
following .show function will repeat output related
to current position index.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:27:18 +01:00
Vasily Averin
dc15af8e9d netfilter: nf_conntrack: ct_cpu_seq_next should increase position index
If .next function does not change position index,
following .show function will repeat output related
to current position index.

Cc: stable@vger.kernel.org
Fixes: 1f4aace60b ("fs/seq_file.c: simplify seq_file iteration code ...")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-04 02:27:18 +01:00
Mat Martineau
6d37a0b857 mptcp: Only send DATA_FIN with final mapping
When a DATA_FIN is sent in a MPTCP DSS option that contains a data
mapping, the DATA_FIN consumes one byte of space in the mapping. In this
case, the DATA_FIN should only be included in the DSS option if its
sequence number aligns with the end of the mapped data. Otherwise the
subflow can send an incorrect implicit sequence number for the DATA_FIN,
and the DATA_ACK for that sequence number would not close the
MPTCP-level connection correctly.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 17:01:43 -08:00
Mat Martineau
76c42a29c0 mptcp: Use per-subflow storage for DATA_FIN sequence number
Instead of reading the MPTCP-level sequence number when sending DATA_FIN,
store the data in the subflow so it can be safely accessed when the
subflow TCP headers are written to the packet without the MPTCP-level
lock held. This also allows the MPTCP-level socket to close individual
subflows without closing the MPTCP connection.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 17:01:43 -08:00
Mat Martineau
1954b86016 mptcp: Check connection state before attempting send
MPTCP should wait for an active connection or skip sending depending on
the connection state, as TCP does. This happens before the possible
passthrough to a regular TCP sendmsg because the subflow's socket type
(MPTCP or TCP fallback) is not known until the connection is
complete. This is also relevent at disconnect time, where data should
not be sent in certain MPTCP-level connection states.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 17:01:42 -08:00
Willem de Bruijn
cf62089b0e bpf: Add gso_size to __sk_buff
BPF programs may want to know whether an skb is gso. The canonical
answer is skb_is_gso(skb), which tests that gso_size != 0.

Expose this field in the same manner as gso_segs. That field itself
is not a sufficient signal, as the comment in skb_shared_info makes
clear: gso_segs may be zero, e.g., from dodgy sources.

Also prepare net/bpf/test_run for upcoming BPF_PROG_TEST_RUN tests
of the feature.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200303200503.226217-2-willemdebruijn.kernel@gmail.com
2020-03-03 16:23:59 -08:00
Russell King
8640f8dc6d net: dsa: fix phylink_start()/phylink_stop() calls
Place phylink_start()/phylink_stop() inside dsa_port_enable() and
dsa_port_disable(), which ensures that we call phylink_stop() before
tearing down phylink - which is a documented requirement.  Failure
to do so can cause use-after-free bugs.

Fixes: 0e27921816 ("net: dsa: Use PHYLINK for the CPU/DSA ports")
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:45:49 -08:00
Parav Pandit
acf1ee44ca devlink: Introduce devlink port flavour virtual
Currently mlx5 PCI PF and VF devlink devices register their ports as
physical port in non-representors mode.

Introduce a new port flavour as virtual so that virtual devices can
register 'virtual' flavour to make it more clear to users.

An example of one PCI PF and 2 PCI virtual functions, each having
one devlink port.

$ devlink port show
pci/0000:06:00.0/1: type eth netdev ens2f0 flavour physical port 0
pci/0000:06:00.2/1: type eth netdev ens2f2 flavour virtual port 0
pci/0000:06:00.3/1: type eth netdev ens2f3 flavour virtual port 0

Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:40:40 -08:00
Paul Blakey
46475bb20f net/sched: act_ct: Software offload of established flows
Offload nf conntrack processing by looking up the 5-tuple in the
zone's flow table.

The nf conntrack module will process the packets until a connection is
in established state. Once in established state, the ct state pointer
(nf_conn) will be restored on the skb from a successful ft lookup.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:09:13 -08:00
Paul Blakey
64ff70b80f net/sched: act_ct: Offload established connections to flow table
Add a ft entry when connections enter an established state and delete
the connections when they leave the established state.

The flow table assumes ownership of the connection. In the following
patch act_ct will lookup the ct state from the FT. In future patches,
drivers will register for callbacks for ft add/del events and will be
able to use the information to offload the connections.

Note that connection aging is managed by the FT.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:09:12 -08:00
Paul Blakey
c34b961a24 net/sched: act_ct: Create nf flow table per zone
Use the NF flow tables infrastructure for CT offload.

Create a nf flow table per zone.

Next patches will add FT entries to this table, and do
the software offload.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:09:12 -08:00
Cambda Zhu
d2f7e56d1e ipv6: Use math to point per net sysctls into the appropriate struct net
The data pointers of ipv6 sysctl are set one by one which is hard to
maintain, especially with kconfig. This patch simplifies it by using
math to point the per net sysctls into the appropriate struct net,
just like what we did for ipv4.

Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 14:50:08 -08:00
Hangbin Liu
d0098e4c6b net/ipv6: remove the old peer route if change it to a new one
When we modify the peer route and changed it to a new one, we should
remove the old route first. Before the fix:

+ ip addr add dev dummy1 2001:db8::1 peer 2001:db8::2
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::2 proto kernel metric 256 pref medium
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::3
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::2 proto kernel metric 256 pref medium

After the fix:
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::3
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 256 pref medium
2001:db8::3 proto kernel metric 256 pref medium

This patch depend on the previous patch "net/ipv6: need update peer route
when modify metric" to update new peer route after delete old one.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 14:43:16 -08:00
Hangbin Liu
617940123e net/ipv6: need update peer route when modify metric
When we modify the route metric, the peer address's route need also
be updated. Before the fix:

+ ip addr add dev dummy1 2001:db8::1 peer 2001:db8::2 metric 60
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 60 pref medium
2001:db8::2 proto kernel metric 60 pref medium
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::2 metric 61
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 61 pref medium
2001:db8::2 proto kernel metric 60 pref medium

After the fix:
+ ip addr change dev dummy1 2001:db8::1 peer 2001:db8::2 metric 61
+ ip -6 route show dev dummy1
2001:db8::1 proto kernel metric 61 pref medium
2001:db8::2 proto kernel metric 61 pref medium

Fixes: 8308f3ff17 ("net/ipv6: Add support for specifying metric of connected routes")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 14:43:16 -08:00
Jakub Kicinski
6ba3da4465 nfc: add missing attribute validation for vendor subcommand
Add missing attribute validation for vendor subcommand attributes
to the netlink policy.

Fixes: 9e58095f96 ("NFC: netlink: Implement vendor command support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
88e706d516 nfc: add missing attribute validation for deactivate target
Add missing attribute validation for NFC_ATTR_TARGET_INDEX
to the netlink policy.

Fixes: 4d63adfe12 ("NFC: Add NFC_CMD_DEACTIVATE_TARGET support")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
361d23e41c nfc: add missing attribute validation for SE API
Add missing attribute validation for NFC_ATTR_SE_INDEX
to the netlink policy.

Fixes: 5ce3f32b52 ("NFC: netlink: SE API implementation")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
213320a679 tipc: add missing attribute validation for MTU property
Add missing attribute validation for TIPC_NLA_PROP_MTU
to the netlink policy.

Fixes: 901271e040 ("tipc: implement configuration of UDP media MTU")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:49 -08:00
Jakub Kicinski
e13aaa0643 net: taprio: add missing attribute validation for txtime delay
Add missing attribute validation for TCA_TAPRIO_ATTR_TXTIME_DELAY
to the netlink policy.

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
7e6dc03eeb net: fq: add missing attribute validation for orphan mask
Add missing attribute validation for TCA_FQ_ORPHAN_MASK
to the netlink policy.

Fixes: 06eb395fa9 ("pkt_sched: fq: better control of DDOS traffic")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
b5ab1f1be6 openvswitch: add missing attribute validation for hash
Add missing attribute validation for OVS_PACKET_ATTR_HASH
to the netlink policy.

Fixes: bd1903b7c4 ("net: openvswitch: add hash info to upcall")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Greg Rose <gvrose8192@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
b60673c4c4 nl802154: add missing attribute validation for dev_type
Add missing attribute type validation for IEEE802154_ATTR_DEV_TYPE
to the netlink policy.

Fixes: 90c049b2c6 ("ieee802154: interface type to be added")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
9322cd7c4a nl802154: add missing attribute validation
Add missing attribute validation for several u8 types.

Fixes: 2c21d11518 ("net: add NL802154 interface for configuration of 802.15.4 devices")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
ff3b63b8c2 devlink: validate length of region addr/len
DEVLINK_ATTR_REGION_CHUNK_ADDR and DEVLINK_ATTR_REGION_CHUNK_LEN
lack entries in the netlink policy. Corresponding nla_get_u64()s
may read beyond the end of the message.

Fixes: 4e54795a27 ("devlink: Add support for region snapshot read command")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Jakub Kicinski
8750939b6a devlink: validate length of param values
DEVLINK_ATTR_PARAM_VALUE_DATA may have different types
so it's not checked by the normal netlink policy. Make
sure the attribute length is what we expect.

Fixes: e3b7ca18ad ("devlink: Add param set command")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Niklas Söderlund
3eb30c51a6 Documentation: nfsroot.rst: Fix references to nfsroot.rst
When converting and moving nfsroot.txt to nfsroot.rst the references to
the old text file was not updated to match the change, fix this.

Fixes: f9a9349846 ("Documentation: nfsroot.txt: convert to ReST")
Signed-off-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se>
Reviewed-by: Geert Uytterhoeven <geert+renesas@glider.be>
Link: https://lore.kernel.org/r/20200212181332.520545-1-niklas.soderlund+renesas@ragnatech.se
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2020-03-02 13:11:46 -07:00
Hangbin Liu
07758eb9ff net/ipv6: use configured metric when add peer route
When we add peer address with metric configured, IPv4 could set the dest
metric correctly, but IPv6 do not. e.g.

]# ip addr add 192.0.2.1 peer 192.0.2.2/32 dev eth1 metric 20
]# ip route show dev eth1
192.0.2.2 proto kernel scope link src 192.0.2.1 metric 20
]# ip addr add 2001:db8::1 peer 2001:db8::2/128 dev eth1 metric 20
]# ip -6 route show dev eth1
2001:db8::1 proto kernel metric 20 pref medium
2001:db8::2 proto kernel metric 256 pref medium

Fix this by using configured metric instead of default one.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 8308f3ff17 ("net/ipv6: Add support for specifying metric of connected routes")
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:55:55 -08:00
Cris Forno
70ae1e127b ethtool: Factored out similar ethtool link settings for virtual devices to core
Three virtual devices (ibmveth, virtio_net, and netvsc) all have
similar code to set link settings and validate ethtool command. To
eliminate duplication of code, it is factored out into core/ethtool.c.

Signed-off-by: Cris Forno <cforno12@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:48:54 -08:00
Taehee Yoo
e0a4b99773 hsr: use upper/lower device infrastructure
netdev_upper_dev_link() is useful to manage lower/upper interfaces.
And this function internally validates looping, maximum depth.
All or most virtual interfaces that could have a real interface
(e.g. macsec, macvlan, ipvlan etc.) use lower/upper infrastructure.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:37:03 -08:00
Taehee Yoo
81390d0c4e hsr: remove unnecessary rcu_read_lock() in hsr module
In order to access the port list, the hsr_port_get_hsr() is used.
And this is protected by RTNL and RCU.
The hsr_fill_info(), hsr_check_carrier(), hsr_dev_open() and
hsr_get_max_mtu() are protected by RTNL.
So, rcu_read_lock() in these functions are not necessary.
The hsr_handle_frame() also uses rcu_read_lock() but this function
is called by packet path.
It's already protected by RCU.
So, the rcu_read_lock() in hsr_handle_frame() can be removed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:37:03 -08:00
Taehee Yoo
4b793acdca hsr: use netdev_err() instead of WARN_ONCE()
When HSR interface is sending a frame, it finds a node with
the destination ethernet address from the list.
If there is no node, it calls WARN_ONCE().
But, using WARN_ONCE() for this situation is a little bit overdoing.
So, in this patch, the netdev_err() is used instead.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:37:03 -08:00
Taehee Yoo
13eeb5fea6 hsr: use extack error message instead of netdev_info
If HSR uses the extack instead of netdev_info(), users can get
error messages immediately without any checking the kernel message.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:37:03 -08:00
Taehee Yoo
f3f2f98470 hsr: use debugfs_remove_recursive() instead of debugfs_remove()
If it uses debugfs_remove_recursive() instead of debugfs_remove(),
hsr_priv() doesn't need to have "node_tbl_file" pointer variable.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:37:03 -08:00
Gustavo A. R. Silva
b90feaff2a net: sched: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:27:02 -08:00
Pablo Neira Ayuso
84b3268027 netlink: Use netlink header as base to calculate bad attribute offset
Userspace might send a batch that is composed of several netlink
messages. The netlink_ack() function must use the pointer to the netlink
header as base to calculate the bad attribute offset.

Fixes: 2d4bc93368 ("netlink: extended ACK reporting")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:21:23 -08:00
David S. Miller
9f0ca0c1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-02-28

The following pull-request contains BPF updates for your *net-next* tree.

We've added 41 non-merge commits during the last 7 day(s) which contain
a total of 49 files changed, 1383 insertions(+), 499 deletions(-).

The main changes are:

1) BPF and Real-Time nicely co-exist.

2) bpftool feature improvements.

3) retrieve bpf_sk_storage via INET_DIAG.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 15:53:35 -08:00
Paolo Abeni
e427cad6ee net: datagram: drop 'destructor' argument from several helpers
The only users for such argument are the UDP protocol and the UNIX
socket family. We can safely reclaim the accounted memory directly
from the UDP code and, after the previous patch, we can do scm
stats accounting outside the datagram helpers.

Overall this cleans up a bit some datagram-related helpers, and
avoids an indirect call per packet in the UDP receive path.

v1 -> v2:
 - call scm_stat_del() only when not peeking - Kirill
 - fix build issue with CONFIG_INET_ESPINTCP

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:12:53 -08:00
Paolo Abeni
7782040b95 unix: uses an atomic type for scm files accounting
So the scm_stat_{add,del} helper can be invoked with no
additional lock held.

This clean-up the code a bit and will make the next
patch easier.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:12:53 -08:00
Gustavo A. R. Silva
d2afb41ae6 net: core: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
b0c9a2d9a8 ipv6: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
8402a31dd8 net: dccp: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
af71b090c8 l2tp: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

Lastly, fix the following checkpatch warning:
CHECK: Prefer kernel type 'u8' over 'uint8_t'
#50: FILE: net/l2tp/l2tp_core.h:119:
+	uint8_t			priv[];	/* private data */

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
680a93166e net: mpls: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
95e486f551 xdp: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Alain Michaud
4b127bd5f2 Bluetooth: Support querying for WBS support through MGMT
This patch provides a mechanism for MGMT interface client to query the
capability of the controller to support WBS.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:53:05 +01:00
Sathish Narsimman
c3bed4de5d Bluetooth: During le_conn_timeout disable EXT_ADV
Disabling LE_LEGACY_ADV when LE_EXT_ADV is enabled causes 'command
disallowed . This patch fixes that issue and disables EXT_ADV if
enabled.

Signed-off-by: Sathish Narsimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:40:06 +01:00
Sathish Narsimman
a9e45698b3 Bluetooth: Remove adv set for directed advertising
Extended advertising Data is set during bluetooth initialization
by default which causes InvalidHCICommandParameters when setting
Extended advertising parameters.

As per Core Spec 5.2 Vol 2, PART E, Sec 7.8.53, for
advertising_event_property LE_LEGACY_ADV_DIRECT_IND does not
supports advertising data when the advertising set already
contains some, the controller shall return erroc code
'InvalidHCICommandParameters(0x12).

So it is required to remove adv set for handle 0x00. since we use
instance 0 for directed adv.

Signed-off-by: Sathish Narsimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:38:05 +01:00
Madhuparna Bhowmik
0c2ac7d4f0 Bluetooth: Use list_for_each_entry_rcu() to traverse RCU list in RCU read-side CS
In function hci_is_blocked_key() RCU list is traversed with
list_for_each_entry() in RCU read-side CS.
Use list_for_each_entry_rcu() instead.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:35:28 +01:00
Madhuparna Bhowmik
d7d41682ef Bluetooth: Fix Suspicious RCU usage warnings
The following functions in hci_core are always called with
hdev->lock held. No need to use list_for_each_entry_rcu(), therefore
change the usage of list_for_each_entry_rcu() in these functions
to list_for_each_entry().

hci_link_keys_clear()
hci_smp_ltks_clear()
hci_smp_irks_clear()
hci_blocked_keys_clear()

Warning encountered with CONFIG_PROVE_RCU_LIST:

[   72.213184] =============================
[   72.213188] WARNING: suspicious RCU usage
[   72.213192] 5.6.0-rc1+ #5 Not tainted
[   72.213195] -----------------------------
[   72.213198] net/bluetooth/hci_core.c:2288 RCU-list traversed in non-reader section!!

[   72.213676] =============================
[   72.213679] WARNING: suspicious RCU usage
[   72.213683] 5.6.0-rc1+ #5 Not tainted
[   72.213685] -----------------------------
[   72.213689] net/bluetooth/hci_core.c:2298 RCU-list traversed in non-reader section!!

[   72.214195] =============================
[   72.214198] WARNING: suspicious RCU usage
[   72.214201] 5.6.0-rc1+ #5 Not tainted
[   72.214204] -----------------------------
[   72.214208] net/bluetooth/hci_core.c:2308 RCU-list traversed in non-reader section!!

[  333.456972] =============================
[  333.456979] WARNING: suspicious RCU usage
[  333.457001] 5.6.0-rc1+ #5 Not tainted
[  333.457007] -----------------------------
[  333.457014] net/bluetooth/hci_core.c:2318 RCU-list traversed in non-reader section!!

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:35:28 +01:00
Gustavo A. R. Silva
a7e454542b Bluetooth: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:30:02 +01:00
Martin KaFai Lau
085c20cacf bpf: inet_diag: Dump bpf_sk_storages in inet_diag_dump()
This patch will dump out the bpf_sk_storages of a sk
if the request has the INET_DIAG_REQ_SK_BPF_STORAGES nlattr.

An array of SK_DIAG_BPF_STORAGE_REQ_MAP_FD can be specified in
INET_DIAG_REQ_SK_BPF_STORAGES to select which bpf_sk_storage to dump.
If no map_fd is specified, all bpf_sk_storages of a sk will be dumped.

bpf_sk_storages can be added to the system at runtime.  It is difficult
to find a proper static value for cb->min_dump_alloc.

This patch learns the nlattr size required to dump the bpf_sk_storages
of a sk.  If it happens to be the very first nlmsg of a dump and it
cannot fit the needed bpf_sk_storages,  it will try to expand the
skb by "pskb_expand_head()".

Instead of expanding it in inet_sk_diag_fill(), it is expanded at a
sleepable context in __inet_diag_dump() so __GFP_DIRECT_RECLAIM can
be used.  In __inet_diag_dump(), it will retry as long as the
skb is empty and the cb->min_dump_alloc becomes larger than before.
cb->min_dump_alloc is bounded by KMALLOC_MAX_SIZE.  The min_dump_alloc
is also changed from 'u16' to 'u32' to accommodate a sk that may have
a few large bpf_sk_storages.

The updated cb->min_dump_alloc will also be used to allocate the skb in
the next dump.  This logic already exists in netlink_dump().

Here is the sample output of a locally modified 'ss' and it could be made
more readable by using BTF later:
[root@arch-fb-vm1 ~]# ss --bpf-map-id 14 --bpf-map-id 13 -t6an 'dst [::1]:8989'
State Recv-Q Send-Q Local Address:Port  Peer Address:PortProcess
ESTAB 0      0              [::1]:51072        [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]
ESTAB 0      0              [::1]:51070        [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]

[root@arch-fb-vm1 ~]# ~/devshare/github/iproute2/misc/ss --bpf-maps -t6an 'dst [::1]:8989'
State         Recv-Q         Send-Q                   Local Address:Port                    Peer Address:Port         Process
ESTAB         0              0                                [::1]:51072                          [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]
	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]
ESTAB         0              0                                [::1]:51070                          [::1]:8989
	 bpf_map_id:14 value:[ 3feb ]
	 bpf_map_id:13 value:[ 3f ]
	 bpf_map_id:12 value:[ 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000... total:65407 ]

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230427.1976129-1-kafai@fb.com
2020-02-27 18:50:19 -08:00
Martin KaFai Lau
1ed4d92458 bpf: INET_DIAG support in bpf_sk_storage
This patch adds INET_DIAG support to bpf_sk_storage.

1. Although this series adds bpf_sk_storage diag capability to inet sk,
   bpf_sk_storage is in general applicable to all fullsock.  Hence, the
   bpf_sk_storage logic will operate on SK_DIAG_* nlattr.  The caller
   will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
   the argument.

2. The request will be like:
	INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
		......

   Considering there could have multiple bpf_sk_storages in a sk,
   instead of reusing INET_DIAG_INFO ("ss -i"),  the user can select
   some specific bpf_sk_storage to dump by specifying an array of
   SK_DIAG_BPF_STORAGE_REQ_MAP_FD.

   If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
   INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
   of a sk.

3. The reply will be like:
	INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
		SK_DIAG_BPF_STORAGE (nla_nest)
			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
		SK_DIAG_BPF_STORAGE (nla_nest)
			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
		......

4. Unlike other INET_DIAG info of a sk which is pretty static, the size
   required to dump the bpf_sk_storage(s) of a sk is dynamic as the
   system adding more bpf_sk_storage_map.  It is hard to set a static
   min_dump_alloc size.

   Hence, this series learns it at the runtime and adjust the
   cb->min_dump_alloc as it iterates all sk(s) of a system.  The
   "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
   is for this purpose.

   The next patch will update the cb->min_dump_alloc as it
   iterates the sk(s).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
2020-02-27 18:50:19 -08:00
Martin KaFai Lau
0df6d32842 inet_diag: Move the INET_DIAG_REQ_BYTECODE nlattr to cb->data
The INET_DIAG_REQ_BYTECODE nlattr is currently re-found every time when
the "dump()" is re-started.

In a latter patch, it will also need to parse the new
INET_DIAG_REQ_SK_BPF_STORAGES nlattr to learn the map_fds. Thus, this
patch takes this chance to store the parsed nlattr in cb->data
during the "start" time of a dump.

By doing this, the "bc" argument also becomes unnecessary
and is removed.  Also, the two copies of the INET_DIAG_REQ_BYTECODE
parsing-audit logic between compat/current version can be
consolidated to one.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230415.1975555-1-kafai@fb.com
2020-02-27 18:50:19 -08:00
Martin KaFai Lau
5682d393b4 inet_diag: Refactor inet_sk_diag_fill(), dump(), and dump_one()
In a latter patch, there is a need to update "cb->min_dump_alloc"
in inet_sk_diag_fill() as it learns the diffierent bpf_sk_storages
stored in a sk while dumping all sk(s) (e.g. tcp_hashinfo).

The inet_sk_diag_fill() currently does not take the "cb" as an argument.
One of the reason is inet_sk_diag_fill() is used by both dump_one()
and dump() (which belong to the "struct inet_diag_handler".  The dump_one()
interface does not pass the "cb" along.

This patch is to make dump_one() pass a "cb".  The "cb" is created in
inet_diag_cmd_exact().  The "nlh" and "in_skb" are stored in "cb" as
the dump() interface does.  The total number of args in
inet_sk_diag_fill() is also cut from 10 to 7 and
that helps many callers to pass fewer args.

In particular,
"struct user_namespace *user_ns", "u32 pid", and "u32 seq"
can be replaced by accessing "cb->nlh" and "cb->skb".

A similar argument reduction is also made to
inet_twsk_diag_fill() and inet_req_diag_fill().

inet_csk_diag_dump() and inet_csk_diag_fill() are also removed.
They are mostly equivalent to inet_sk_diag_fill().  Their repeated
usages are very limited.  Thus, inet_sk_diag_fill() is directly used
in those occasions.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230409.1975173-1-kafai@fb.com
2020-02-27 18:50:19 -08:00
David S. Miller
9f6e055907 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The mptcp conflict was overlapping additions.

The SMC conflict was an additional and removal happening at the same
time.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 18:31:39 -08:00
Stefano Garzarella
3f74957fcb vsock: fix potential deadlock in transport->release()
Some transports (hyperv, virtio) acquire the sock lock during the
.release() callback.

In the vsock_stream_connect() we call vsock_assign_transport(); if
the socket was previously assigned to another transport, the
vsk->transport->release() is called, but the sock lock is already
held in the vsock_stream_connect(), causing a deadlock reported by
syzbot:

    INFO: task syz-executor280:9768 blocked for more than 143 seconds.
      Not tainted 5.6.0-rc1-syzkaller #0
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    syz-executor280 D27912  9768   9766 0x00000000
    Call Trace:
     context_switch kernel/sched/core.c:3386 [inline]
     __schedule+0x934/0x1f90 kernel/sched/core.c:4082
     schedule+0xdc/0x2b0 kernel/sched/core.c:4156
     __lock_sock+0x165/0x290 net/core/sock.c:2413
     lock_sock_nested+0xfe/0x120 net/core/sock.c:2938
     virtio_transport_release+0xc4/0xd60 net/vmw_vsock/virtio_transport_common.c:832
     vsock_assign_transport+0xf3/0x3b0 net/vmw_vsock/af_vsock.c:454
     vsock_stream_connect+0x2b3/0xc70 net/vmw_vsock/af_vsock.c:1288
     __sys_connect_file+0x161/0x1c0 net/socket.c:1857
     __sys_connect+0x174/0x1b0 net/socket.c:1874
     __do_sys_connect net/socket.c:1885 [inline]
     __se_sys_connect net/socket.c:1882 [inline]
     __x64_sys_connect+0x73/0xb0 net/socket.c:1882
     do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
     entry_SYSCALL_64_after_hwframe+0x49/0xbe

To avoid this issue, this patch remove the lock acquiring in the
.release() callback of hyperv and virtio transports, and it holds
the lock when we call vsk->transport->release() in the vsock core.

Reported-by: syzbot+731710996d79d0d58fbc@syzkaller.appspotmail.com
Fixes: 408624af4c ("vsock: use local transport when it is loaded")
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 12:03:56 -08:00
Russell King
5b502a7b29 net: dsa: propagate resolved link config via mac_link_up()
Propagate the resolved link configuration down via DSA's
phylink_mac_link_up() operation to allow split PCS/MAC to work.

Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 12:02:14 -08:00
Russell King
91a208f218 net: phylink: propagate resolved link config via mac_link_up()
Propagate the resolved link parameters via the mac_link_up() call for
MACs that do not automatically track their PCS state. We propagate the
link parameters via function arguments so that inappropriate members
of struct phylink_link_state can't be accessed, and creating a new
structure just for this adds needless complexity to the API.

Tested-by: Andre Przywara <andre.przywara@arm.com>
Tested-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 12:02:14 -08:00
David S. Miller
5c05a164d4 unix: It's CONFIG_PROC_FS not CONFIG_PROCFS
Fixes: 3a12500ed5 ("unix: define and set show_fdinfo only if procfs is enabled")
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:52:35 -08:00
Tobias Klauser
3a12500ed5 unix: define and set show_fdinfo only if procfs is enabled
Follow the pattern used with other *_show_fdinfo functions and only
define unix_show_fdinfo and set it in proto_ops if CONFIG_PROCFS
is set.

Fixes: 3c32da19a8 ("unix: Show number of pending scm files of receive queue in fdinfo")
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 11:11:57 -08:00
Madhuparna Bhowmik
edf0d283d9 ipv6: xfrm6_tunnel.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-27 10:17:41 +01:00
Russell King
07c6f9805f net: switchdev: do not propagate bridge updates across bridges
When configuring a tree of independent bridges, propagating changes
from the upper bridge across a bridge master to the lower bridge
ports brings surprises.

For example, a lower bridge may have vlan filtering enabled.  It
may have a vlan interface attached to the bridge master, which may
then be incorporated into another bridge.  As soon as the lower
bridge vlan interface is attached to the upper bridge, the lower
bridge has vlan filtering disabled.

This occurs because switchdev recursively applies its changes to
all lower devices no matter what.

Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Tested-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:58:33 -08:00
Karsten Graul
a2f2ef4a54 net/smc: check for valid ib_client_data
In smc_ib_remove_dev() check if the provided ib device was actually
initialized for SMC before.

Reported-by: syzbot+84484ccebdd4e5451d91@syzkaller.appspotmail.com
Fixes: a4cf0443c4 ("smc: introduce SMC as an IB-client")
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:56:25 -08:00
Dan Carpenter
9baeea5071 net: qrtr: Fix error pointer vs NULL bugs
The callers only expect NULL pointers, so returning an error pointer
will lead to an Oops.

Fixes: 0c2204a4ad ("net: qrtr: Migrate nameservice to kernel from userspace")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:52:31 -08:00
Paolo Abeni
dc24f8b4ec mptcp: add dummy icsk_sync_mss()
syzbot noted that the master MPTCP socket lacks the icsk_sync_mss
callback, and was able to trigger a null pointer dereference:

BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 8e171067 P4D 8e171067 PUD 93fa2067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 8984 Comm: syz-executor066 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc900020b7b80 EFLAGS: 00010246
RAX: 1ffff110124ba600 RBX: 0000000000000000 RCX: ffff88809fefa600
RDX: ffff8880994cdb18 RSI: 0000000000000000 RDI: ffff8880925d3140
RBP: ffffc900020b7bd8 R08: ffffffff870225be R09: fffffbfff140652a
R10: fffffbfff140652a R11: 0000000000000000 R12: ffff8880925d35d0
R13: ffff8880925d3140 R14: dffffc0000000000 R15: 1ffff110124ba6ba
FS:  0000000001a0b880(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a6d6f000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 cipso_v4_sock_setattr+0x34b/0x470 net/ipv4/cipso_ipv4.c:1888
 netlbl_sock_setattr+0x2a7/0x310 net/netlabel/netlabel_kapi.c:989
 smack_netlabel security/smack/smack_lsm.c:2425 [inline]
 smack_inode_setsecurity+0x3da/0x4a0 security/smack/smack_lsm.c:2716
 security_inode_setsecurity+0xb2/0x140 security/security.c:1364
 __vfs_setxattr_noperm+0x16f/0x3e0 fs/xattr.c:197
 vfs_setxattr fs/xattr.c:224 [inline]
 setxattr+0x335/0x430 fs/xattr.c:451
 __do_sys_fsetxattr fs/xattr.c:506 [inline]
 __se_sys_fsetxattr+0x130/0x1b0 fs/xattr.c:495
 __x64_sys_fsetxattr+0xbf/0xd0 fs/xattr.c:495
 do_syscall_64+0xf7/0x1c0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x440199
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffcadc19e48 EFLAGS: 00000246 ORIG_RAX: 00000000000000be
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440199
RDX: 0000000020000200 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000003 R09: 00000000004002c8
R10: 0000000000000009 R11: 0000000000000246 R12: 0000000000401a20
R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
CR2: 0000000000000000

Address the issue adding a dummy icsk_sync_mss callback.
To properly sync the subflows mss and options list we need some
additional infrastructure, which will land to net-next.

Reported-by: syzbot+f4dfece964792d80b139@syzkaller.appspotmail.com
Fixes: 2303f994b3 ("mptcp: Associate MPTCP context with TCP socket")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:49:50 -08:00
Paolo Abeni
14c441b564 mptcp: defer work schedule until mptcp lock is released
Don't schedule the work queue right away, instead defer this
to the lock release callback.

This has the advantage that it will give recv path a chance to
complete -- this might have moved all pending packets from the
subflow to the mptcp receive queue, which allows to avoid the
schedule_work().

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
2e52213c79 mptcp: avoid work queue scheduling if possible
We can't lock_sock() the mptcp socket from the subflow data_ready callback,
it would result in ABBA deadlock with the subflow socket lock.

We can however grab the spinlock: if that succeeds and the mptcp socket
is not owned at the moment, we can process the new skbs right away
without deferring this to the work queue.

This avoids the schedule_work and hence the small delay until the
work item is processed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
bfae9dae44 mptcp: remove mptcp_read_actor
Only used to discard stale data from the subflow, so move
it where needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
600911ff5f mptcp: add rmem queue accounting
If userspace never drains the receive buffers we must stop draining
the subflow socket(s) at some point.

This adds the needed rmem accouting for this.
If the threshold is reached, we stop draining the subflows.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
6771bfd9ee mptcp: update mptcp ack sequence from work queue
If userspace is not reading data, all the mptcp-level acks contain the
ack_seq from the last time userspace read data rather than the most
recent in-sequence value.

This causes pointless retransmissions for data that is already queued.

The reason for this is that all the mptcp protocol level processing
happens at mptcp_recv time.

This adds work queue to move skbs from the subflow sockets receive
queue on the mptcp socket receive queue (which was not used so far).

This allows us to announce the correct mptcp ack sequence in a timely
fashion, even when the application does not call recv() on the mptcp socket
for some time.

We still wake userspace tasks waiting for POLLIN immediately:
If the mptcp level receive queue is empty (because the work queue is
still pending) it can be filled from in-sequence subflow sockets at
recv time without a need to wait for the worker.

The skb_orphan when moving skbs from subflow to mptcp level is needed,
because the destructor (sock_rfree) relies on skb->sk (ssk!) lock
being taken.

A followup patch will add needed rmem accouting for the moved skbs.

Other problem: In case application behaves as expected, and calls
recv() as soon as mptcp socket becomes readable, the work queue will
only waste cpu cycles.  This will also be addressed in followup patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Paolo Abeni
8099201715 mptcp: add work queue skeleton
Will be extended with functionality in followup patches.
Initial user is moving skbs from subflows receive queue to
the mptcp-level receive queue.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Florian Westphal
101f6f851e mptcp: add and use mptcp_data_ready helper
allows us to schedule the work queue to drain the ssk receive queue in
a followup patch.

This is needed to avoid sending all-to-pessimistic mptcp-level
acknowledgements.  At this time, the ack_seq is what was last read by
userspace instead of the highest in-sequence number queued for reading.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:46:26 -08:00
Randy Dunlap
c535f92032 af_llc: fix if-statement empty body warning
When debugging via dprintk() is not enabled, make the dprintk()
macro be an empty do-while loop, as is done in
<linux/sunrpc/debug.h>.

This fixes a gcc warning when -Wextra is set:
../net/llc/af_llc.c:974:51: warning: suggest braces around empty body in an ‘if’ statement [-Wempty-body]

I have verified that there is not object code change (with gcc 7.5.0).

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: netdev@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:38:13 -08:00
Hans Wippel
a082ec897f net/smc: improve peer ID in CLC decline for SMC-R
According to RFC 7609, all CLC messages contain a peer ID that consists
of a unique instance ID and the MAC address of one of the host's RoCE
devices. But if a SMC-R connection cannot be established, e.g., because
no matching pnet table entry is found, the current implementation uses a
zero value in the CLC decline message although the host's peer ID is set
to a proper value.

If no RoCE and no ISM device is usable for a connection, there is no LGR
and the LGR check in smc_clc_send_decline() prevents that the peer ID is
copied into the CLC decline message for both SMC-D and SMC-R. So, this
patch modifies the check to also accept the case of no LGR. Also, only a
valid peer ID is copied into the decline message.

Signed-off-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:27:06 -08:00
Hans Wippel
366bb249b5 net/smc: rework peer ID handling
This patch initializes the peer ID to a random instance ID and a zero
MAC address. If a RoCE device is in the host, the MAC address part of
the peer ID is overwritten with the respective address. Also, a function
for checking if the peer ID is valid is added. A peer ID is considered
valid if the MAC address part contains a non-zero MAC address.

Signed-off-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:27:06 -08:00
Arjun Roy
0b7f41f687 tcp-zerocopy: Update returned getsockopt() optlen.
TCP receive zerocopy currently does not update the returned optlen for
getsockopt() if the user passed in a larger than expected value.
Thus, userspace cannot properly determine if all the fields are set in
the passed-in struct. This patch sets the optlen for this case before
returning, in keeping with the expected operation of getsockopt().

Fixes: c8856c0514 ("tcp-zerocopy: Return inq along with tcp receive zerocopy.")
Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:24:22 -08:00
Eric Dumazet
b6f6118901 ipv6: restrict IPV6_ADDRFORM operation
IPV6_ADDRFORM is able to transform IPv6 socket to IPv4 one.
While this operation sounds illogical, we have to support it.

One of the things it does for TCP socket is to switch sk->sk_prot
to tcp_prot.

We now have other layers playing with sk->sk_prot, so we should make
sure to not interfere with them.

This patch makes sure sk_prot is the default pointer for TCP IPv6 socket.

syzbot reported :
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD a0113067 P4D a0113067 PUD a8771067 PMD 0
Oops: 0010 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 10686 Comm: syz-executor.0 Not tainted 5.6.0-rc2-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
FS:  00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 inet_release+0x165/0x1c0 net/ipv4/af_inet.c:427
 __sock_release net/socket.c:605 [inline]
 sock_close+0xe1/0x260 net/socket.c:1283
 __fput+0x2e4/0x740 fs/file_table.c:280
 ____fput+0x15/0x20 fs/file_table.c:313
 task_work_run+0x176/0x1b0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:164 [inline]
 prepare_exit_to_usermode+0x480/0x5b0 arch/x86/entry/common.c:195
 syscall_return_slowpath+0x113/0x4a0 arch/x86/entry/common.c:278
 do_syscall_64+0x11f/0x1c0 arch/x86/entry/common.c:304
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x45c429
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f2ae75dac78 EFLAGS: 00000246 ORIG_RAX: 0000000000000036
RAX: 0000000000000000 RBX: 00007f2ae75db6d4 RCX: 000000000045c429
RDX: 0000000000000001 RSI: 000000000000011a RDI: 0000000000000004
RBP: 000000000076bf20 R08: 0000000000000038 R09: 0000000000000000
R10: 0000000020000180 R11: 0000000000000246 R12: 00000000ffffffff
R13: 0000000000000a9d R14: 00000000004ccfb4 R15: 000000000076bf2c
Modules linked in:
CR2: 0000000000000000
---[ end trace 82567b5207e87bae ]---
RIP: 0010:0x0
Code: Bad RIP value.
RSP: 0018:ffffc9000281fce0 EFLAGS: 00010246
RAX: 1ffffffff15f48ac RBX: ffffffff8afa4560 RCX: dffffc0000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8880a69a8f40
RBP: ffffc9000281fd10 R08: ffffffff86ed9b0c R09: ffffed1014d351f5
R10: ffffed1014d351f5 R11: 0000000000000000 R12: ffff8880920d3098
R13: 1ffff1101241a613 R14: ffff8880a69a8f40 R15: 0000000000000000
FS:  00007f2ae75db700(0000) GS:ffff8880aea00000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffffffffffffd6 CR3: 00000000a3b85000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+1938db17e275e85dc328@syzkaller.appspotmail.com
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:20:58 -08:00
Ursula Braun
51e3dfa890 net/smc: fix cleanup for linkgroup setup failures
If an SMC connection to a certain peer is setup the first time,
a new linkgroup is created. In case of setup failures, such a
linkgroup is unusable and should disappear. As a first step the
linkgroup is removed from the linkgroup list in smc_lgr_forget().

There are 2 problems:
smc_listen_decline() might be called before linkgroup creation
resulting in a crash due to calling smc_lgr_forget() with
parameter NULL.
If a setup failure occurs after linkgroup creation, the connection
is never unregistered from the linkgroup, preventing linkgroup
freeing.

This patch introduces an enhanced smc_lgr_cleanup_early() function
which
* contains a linkgroup check for early smc_listen_decline()
  invocations
* invokes smc_conn_free() to guarantee unregistering of the
  connection.
* schedules fast linkgroup removal of the unusable linkgroup

And the unused function smcd_conn_free() is removed from smc_core.h.

Fixes: 3b2dec2603 ("net/smc: restructure client and server code in af_smc")
Fixes: 2a0674fffb ("net/smc: improve abnormal termination of link groups")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:18:07 -08:00
Christian Brauner
ef6a4c88e9 net: fix sysfs permssions when device changes network namespace
Now that we moved all the helpers in place and make use netdev_change_owner()
to fixup the permissions when moving network devices between network
namespaces.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:07:26 -08:00
Christian Brauner
d755407d44 net-sysfs: add queue_change_owner()
Add a function to change the owner of the queue entries for a network device
when it is moved between network namespaces.

Currently, when moving network devices between network namespaces the
ownership of the corresponding queue sysfs entries are not changed. This leads
to problems when tools try to operate on the corresponding sysfs files. Fix
this.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:07:26 -08:00
Christian Brauner
e6dee9f389 net-sysfs: add netdev_change_owner()
Add a function to change the owner of a network device when it is moved
between network namespaces.

Currently, when moving network devices between network namespaces the
ownership of the corresponding sysfs entries is not changed. This leads
to problems when tools try to operate on the corresponding sysfs files.
This leads to a bug whereby a network device that is created in a
network namespaces owned by a user namespace will have its corresponding
sysfs entry owned by the root user of the corresponding user namespace.
If such a network device has to be moved back to the host network
namespace the permissions will still be set to the user namespaces. This
means unprivileged users can e.g. trigger uevents for such incorrectly
owned devices. They can also modify the settings of the device itself.
Both of these things are unwanted.

For example, workloads will create network devices in the host network
namespace. Other tools will then proceed to move such devices between
network namespaces owner by other user namespaces. While the ownership
of the device itself is updated in
net/core/net-sysfs.c:dev_change_net_namespace() the corresponding sysfs
entry for the device is not:

drwxr-xr-x 5 nobody nobody    0 Jan 25 18:08 .
drwxr-xr-x 9 nobody nobody    0 Jan 25 18:08 ..
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_assign_type
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 addr_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 address
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 broadcast
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_changes
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_down_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 carrier_up_count
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dev_port
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 dormant
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 duplex
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 flags
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 gro_flush_timeout
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifalias
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 ifindex
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 iflink
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 link_mode
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 mtu
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 name_assign_type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 netdev_group
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 operstate
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_id
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_port_name
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 phys_switch_id
drwxr-xr-x 2 nobody nobody    0 Jan 25 18:09 power
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 proto_down
drwxr-xr-x 4 nobody nobody    0 Jan 25 18:09 queues
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 speed
drwxr-xr-x 2 nobody nobody    0 Jan 25 18:09 statistics
lrwxrwxrwx 1 nobody nobody    0 Jan 25 18:08 subsystem -> ../../../../class/net
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:09 tx_queue_len
-r--r--r-- 1 nobody nobody 4096 Jan 25 18:09 type
-rw-r--r-- 1 nobody nobody 4096 Jan 25 18:08 uevent

However, if a device is created directly in the network namespace then
the device's sysfs permissions will be correctly updated:

drwxr-xr-x 5 root   root      0 Jan 25 18:12 .
drwxr-xr-x 9 nobody nobody    0 Jan 25 18:08 ..
-r--r--r-- 1 root   root   4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 root   root   4096 Jan 25 18:12 addr_len
-r--r--r-- 1 root   root   4096 Jan 25 18:12 address
-r--r--r-- 1 root   root   4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 carrier
-r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 root   root   4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 root   root   4096 Jan 25 18:12 dev_id
-r--r--r-- 1 root   root   4096 Jan 25 18:12 dev_port
-r--r--r-- 1 root   root   4096 Jan 25 18:12 dormant
-r--r--r-- 1 root   root   4096 Jan 25 18:12 duplex
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 flags
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 ifalias
-r--r--r-- 1 root   root   4096 Jan 25 18:12 ifindex
-r--r--r-- 1 root   root   4096 Jan 25 18:12 iflink
-r--r--r-- 1 root   root   4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 mtu
-r--r--r-- 1 root   root   4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 root   root   4096 Jan 25 18:12 operstate
-r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 root   root   4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 root   root      0 Jan 25 18:12 power
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 root   root      0 Jan 25 18:12 queues
-r--r--r-- 1 root   root   4096 Jan 25 18:12 speed
drwxr-xr-x 2 root   root      0 Jan 25 18:12 statistics
lrwxrwxrwx 1 nobody nobody    0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 root   root   4096 Jan 25 18:12 type
-rw-r--r-- 1 root   root   4096 Jan 25 18:12 uevent

Now, when creating a network device in a network namespace owned by a
user namespace and moving it to the host the permissions will be set to
the id that the user namespace root user has been mapped to on the host
leading to all sorts of permission issues:

458752
drwxr-xr-x 5 458752 458752      0 Jan 25 18:12 .
drwxr-xr-x 9 root   root        0 Jan 25 18:08 ..
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 addr_assign_type
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 addr_len
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 address
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 broadcast
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_changes
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_down_count
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 carrier_up_count
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dev_id
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dev_port
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 dormant
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 duplex
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 flags
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 gro_flush_timeout
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 ifalias
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 ifindex
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 iflink
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 link_mode
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 mtu
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 name_assign_type
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 netdev_group
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 operstate
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_port_id
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_port_name
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 phys_switch_id
drwxr-xr-x 2 458752 458752      0 Jan 25 18:12 power
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 proto_down
drwxr-xr-x 4 458752 458752      0 Jan 25 18:12 queues
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 speed
drwxr-xr-x 2 458752 458752      0 Jan 25 18:12 statistics
lrwxrwxrwx 1 root   root        0 Jan 25 18:12 subsystem -> ../../../../class/net
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 tx_queue_len
-r--r--r-- 1 458752 458752   4096 Jan 25 18:12 type
-rw-r--r-- 1 458752 458752   4096 Jan 25 18:12 uevent

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 20:07:25 -08:00
Jiri Pirko
1521a67e60 sched: act: count in the size of action flags bitfield
The put of the flags was added by the commit referenced in fixes tag,
however the size of the message was not extended accordingly.

Fix this by adding size of the flags bitfield to the message size.

Fixes: e382267860 ("net: sched: update action implementations to support flags")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 17:10:44 -08:00
Madhuparna Bhowmik
2eb51c75dc net: core: devlink.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled.

The devlink->lock is held when devlink_dpipe_table_find()
is called in non RCU read side section. Therefore, pass struct devlink
to devlink_dpipe_table_find() for lockdep checking.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 16:59:18 -08:00
David S. Miller
574b238f64 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes:

1) Perform garbage collection from workqueue to fix rcu detected
   stall in ipset hash set types, from Jozsef Kadlecsik.

2) Fix the forceadd evaluation path, also from Jozsef.

3) Fix nft_set_pipapo selftest, from Stefano Brivio.

4) Crash when add-flush-add element in pipapo set, also from Stefano.
   Add test to cover this crash.

5) Remove sysctl entry under mutex in hashlimit, from Cong Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 16:30:17 -08:00
Cong Wang
99b79c3900 netfilter: xt_hashlimit: unregister proc file before releasing mutex
Before releasing the global mutex, we only unlink the hashtable
from the hash list, its proc file is still not unregistered at
this point. So syzbot could trigger a race condition where a
parallel htable_create() could register the same file immediately
after the mutex is released.

Move htable_remove_proc_entry() back to mutex protection to
fix this. And, fold htable_destroy() into htable_put() to make
the code slightly easier to understand.

Reported-and-tested-by: syzbot+d195fd3b9a364ddd6731@syzkaller.appspotmail.com
Fixes: c4a3922d2d ("netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()")
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26 23:25:07 +01:00
Michal Kubecek
e34f1753ee ethtool: limit bitset size
Syzbot reported that ethnl_compact_sanity_checks() can be tricked into
reading past the end of ETHTOOL_A_BITSET_VALUE and ETHTOOL_A_BITSET_MASK
attributes and even the message by passing a value between (u32)(-31)
and (u32)(-1) as ETHTOOL_A_BITSET_SIZE.

The problem is that DIV_ROUND_UP(attr_nbits, 32) is 0 for such values so
that zero length ETHTOOL_A_BITSET_VALUE will pass the length check but
ethnl_bitmap32_not_zero() check would try to access up to 512 MB of
attribute "payload".

Prevent this overflow byt limiting the bitset size. Technically, compact
bitset format would allow bitset sizes up to almost 2^18 (so that the
nest size does not exceed U16_MAX) but bitsets used by ethtool are much
shorter. S16_MAX, the largest value which can be directly used as an
upper limit in policy, should be a reasonable compromise.

Fixes: 10b518d4e6 ("ethtool: netlink bitset handling")
Reported-by: syzbot+7fd4ed5b4234ab1fdccd@syzkaller.appspotmail.com
Reported-by: syzbot+709b7a64d57978247e44@syzkaller.appspotmail.com
Reported-by: syzbot+983cb8fb2d17a7af549d@syzkaller.appspotmail.com
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 11:27:31 -08:00
Amritha Nambiar
6e11d1578f net: Fix Tx hash bound checking
Fixes the lower and upper bounds when there are multiple TCs and
traffic is on the the same TC on the same device.

The lower bound is represented by 'qoffset' and the upper limit for
hash value is 'qcount + qoffset'. This gives a clean Rx to Tx queue
mapping when there are multiple TCs, as the queue indices for upper TCs
will be offset by 'qoffset'.

v2: Fixed commit description based on comments.

Fixes: 1b837d489e ("net: Revoke export for __skb_tx_hash, update it to just be static skb_tx_hash")
Fixes: eadec877ce ("net: Add support for subordinate traffic classes to netdev_pick_tx")
Signed-off-by: Amritha Nambiar <amritha.nambiar@intel.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Reviewed-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-26 11:14:10 -08:00
Stefano Brivio
212d58c106 nft_set_pipapo: Actually fetch key data in nft_pipapo_remove()
Phil reports that adding elements, flushing and re-adding them
right away:

  nft add table t '{ set s { type ipv4_addr . inet_service; flags interval; }; }'
  nft add element t s '{ 10.0.0.1 . 22-25, 10.0.0.1 . 10-20 }'
  nft flush set t s
  nft add element t s '{ 10.0.0.1 . 10-20, 10.0.0.1 . 22-25 }'

triggers, almost reliably, a crash like this one:

  [   71.319848] general protection fault, probably for non-canonical address 0x6f6b6e696c2e756e: 0000 [#1] PREEMPT SMP PTI
  [   71.321540] CPU: 3 PID: 1201 Comm: kworker/3:2 Not tainted 5.6.0-rc1-00377-g2bb07f4e1d861 #192
  [   71.322746] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190711_202441-buildvm-armv7-10.arm.fedoraproject.org-2.fc31 04/01/2014
  [   71.324430] Workqueue: events nf_tables_trans_destroy_work [nf_tables]
  [   71.325387] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
  [   71.326164] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
  [   71.328423] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
  [   71.329225] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
  [   71.330365] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
  [   71.331473] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
  [   71.332627] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
  [   71.333615] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
  [   71.334596] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
  [   71.335780] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   71.336577] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
  [   71.337533] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [   71.338557] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [   71.339718] Call Trace:
  [   71.340093]  nft_pipapo_destroy+0x7a/0x170 [nf_tables_set]
  [   71.340973]  nft_set_destroy+0x20/0x50 [nf_tables]
  [   71.341879]  nf_tables_trans_destroy_work+0x246/0x260 [nf_tables]
  [   71.342916]  process_one_work+0x1d5/0x3c0
  [   71.343601]  worker_thread+0x4a/0x3c0
  [   71.344229]  kthread+0xfb/0x130
  [   71.344780]  ? process_one_work+0x3c0/0x3c0
  [   71.345477]  ? kthread_park+0x90/0x90
  [   71.346129]  ret_from_fork+0x35/0x40
  [   71.346748] Modules linked in: nf_tables_set nf_tables nfnetlink 8021q [last unloaded: nfnetlink]
  [   71.348153] ---[ end trace 2eaa8149ca759bcc ]---
  [   71.349066] RIP: 0010:nft_set_elem_destroy+0xa5/0x110 [nf_tables]
  [   71.350016] Code: 89 d4 84 c0 74 0e 8b 77 44 0f b6 f8 48 01 df e8 41 ff ff ff 45 84 e4 74 36 44 0f b6 63 08 45 84 e4 74 2c 49 01 dc 49 8b 04 24 <48> 8b 40 38 48 85 c0 74 4f 48 89 e7 4c 8b
  [   71.350017] RSP: 0018:ffffc9000226fd90 EFLAGS: 00010282
  [   71.350019] RAX: 6f6b6e696c2e756e RBX: ffff88813ab79f60 RCX: ffff88813931b5a0
  [   71.350019] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffff88813ab79f9a
  [   71.350020] RBP: ffff88813ab79f60 R08: 0000000000000008 R09: 0000000000000000
  [   71.350021] R10: 000000000000021c R11: 0000000000000000 R12: ffff88813ab79fc2
  [   71.350022] R13: ffff88813b3adf50 R14: dead000000000100 R15: ffff88813931b8a0
  [   71.350025] FS:  0000000000000000(0000) GS:ffff88813bd80000(0000) knlGS:0000000000000000
  [   71.350026] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [   71.350027] CR2: 000055ac683710f0 CR3: 000000013a222003 CR4: 0000000000360ee0
  [   71.350028] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
  [   71.350028] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
  [   71.350030] Kernel panic - not syncing: Fatal exception
  [   71.350412] Kernel Offset: disabled
  [   71.365922] ---[ end Kernel panic - not syncing: Fatal exception ]---

which is caused by dangling elements that have been deactivated, but
never removed.

On a flush operation, nft_pipapo_walk() walks through all the elements
in the mapping table, which are then deactivated by nft_flush_set(),
one by one, and added to the commit list for removal. Element data is
then freed.

On transaction commit, nft_pipapo_remove() is called, and failed to
remove these elements, leading to the stale references in the mapping.
The first symptom of this, revealed by KASan, is a one-byte
use-after-free in subsequent calls to nft_pipapo_walk(), which is
usually not enough to trigger a panic. When stale elements are used
more heavily, though, such as double-free via nft_pipapo_destroy()
as in Phil's case, the problem becomes more noticeable.

The issue comes from that fact that, on a flush operation,
nft_pipapo_remove() won't get the actual key data via elem->key,
elements to be deleted upon commit won't be found by the lookup via
pipapo_get(), and removal will be skipped. Key data should be fetched
via nft_set_ext_key(), instead.

Reported-by: Phil Sutter <phil@nwl.cc>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26 14:33:09 +01:00
Pablo Neira Ayuso
9ea4894ba4 Merge branch 'master' of git://blackhole.kfki.hu/nf
Jozsef Kadlecsik says:

====================
ipset patches for nf

The first one is larger than usual, but the issue could not be solved simpler.
Also, it's a resend of the patch I submitted a few days ago, with a one line
fix on top of that: the size of the comment extensions was not taken into
account at reporting the full size of the set.

- Fix "INFO: rcu detected stall in hash_xxx" reports of syzbot
  by introducing region locking and using workqueue instead of timer based
  gc of timed out entries in hash types of sets in ipset.
- Fix the forceadd evaluation path - the bug was also uncovered by the syzbot.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-26 13:55:15 +01:00
Jiri Pirko
5a2e106c74 devlink: extend devlink_trap_report() to accept cookie and pass
Add cookie argument to devlink_trap_report() allowing driver to pass in
the user cookie. Pass on the cookie down to drop monitor code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
Jiri Pirko
742b8cceaa drop_monitor: extend by passing cookie from driver
If driver passed along the cookie, push it through Netlink.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
Jiri Pirko
85b0589ede devlink: add trap metadata type for cookie
Allow driver to indicate cookie metadata for registered traps.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
Jiri Pirko
2008495d81 flow_offload: pass action cookie through offload structures
Extend struct flow_action_entry in order to hold TC action cookie
specified by user inserting the action.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
David Miller
6eac7795e8 bpf/tests: Use migrate disable instead of preempt disable
Replace the preemption disable/enable with migrate_disable/enable() to
reflect the actual requirement and to allow PREEMPT_RT to substitute it
with an actual migration disable mechanism which does not disable
preemption.

[ tglx: Switched it over to migrate disable ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.785306549@linutronix.de
2020-02-24 16:20:09 -08:00
David Miller
3d9f773cf2 bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites.
All of these cases are strictly of the form:

	preempt_disable();
	BPF_PROG_RUN(...);
	preempt_enable();

Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
with:

	migrate_disable();
	BPF_PROG_RUN(...);
	migrate_enable();

On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.

Therefore, this is a trivially correct transformation.

The seccomp loop does not need protection over the loop. It only needs
protection per BPF filter program

[ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de
2020-02-24 16:20:09 -08:00
David S. Miller
3614d05b5e A few fixes:
* remove a double mutex-unlock
  * fix a leak in an error path
  * NULL pointer check
  * include if_vlan.h where needed
  * avoid RCU list traversal when not under RCU
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5UGA8ACgkQB8qZga/f
 l8Qk4g/8DstHoC7R74x4jrOyGffw7ss7XC8YJi2x3IENg7mAYdEFvnUC7GQmrCso
 aMypkOWac0H4cAw2BId8CFevbWZbanJ9uvC4JicQpaNa9bZkqBaDMFwO3k66S1hA
 JBvmOqPy8FNuFRsXTicMPw7/I6QazR+sW7Ed68L7ygikkklvTeovH9qkL+zHHMcK
 dAVS6ZKB2jSkEOtnlG01QMRwdIijTkD6KjGDECVMzau9XC+fGowQeuy+xuAU0THq
 Myef9CpzU2Q47Sa61fEyXVC54Izp9B1ZuohM2wfrN2zC3SKxsmPM1lm4GhrJOfUF
 wJTQY75z1kAmCAJDKpwBrFkyIEfUPQV1v9HHit0LX4y11NYNWL22liLhPv+kZDmD
 9ht1IxAP1Nw/DSVKaXUmwk1tFkETfk2BvCkNjVrwKlt88QgvQAL/uzgKoz5E6VIM
 hU1hp0mJCzsmyt/COWOf0l/QpO2nN3/64YHZgFiqOGnhXr5r0L74Jf3lcdWyaXRU
 hcJ0M/EcfGLWOlHFcRkNdWCkBFk5Daf4/zzQ3vO9weAlb7wOLDo9mHoXMBgP6R9l
 kCnxjxgIqmpQLvMo/sBt2cLaf3bJU40p/BYU7hqI7CerODFc3JKkCs1ODZPbYDaR
 Fk60mlElkBD3P4EU8/6nbXHOfG9FvEiZjhT7kRcUdPx0dLA/KMk=
 =7JH4
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg

====================
A few fixes:
 * remove a double mutex-unlock
 * fix a leak in an error path
 * NULL pointer check
 * include if_vlan.h where needed
 * avoid RCU list traversal when not under RCU
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 15:43:38 -08:00
David S. Miller
3b3e808cd8 A new set of changes:
* lots of small documentation fixes, from Jérôme Pouiller
  * beacon protection (BIGTK) support from Jouni Malinen
  * some initial code for TID configuration, from Tamizh chelvam
  * I reverted some new API before it's actually used, because
    it's wrong to mix controlled port and preauth
  * a few other cleanups/fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5UFlsACgkQB8qZga/f
 l8SwNQ//Y/dELGODumDxY03l/Bj/+XL7HYl3Yn+J8mt2mYPC3zjTjvcRJBApiAog
 1Oyd75fd+EqjxDUT+ngN8uJLQ/yCVsdwfpZhwV7VanAOuLI9aYkIXnai/Qs96rj3
 yDIlrBVsmfsaxf/2e9UmsLmeUSm7C5s1EzaAIwoRGvcUH0pbH9WYAXF1QV+8fmXa
 yoXuHV5Bv+wOW2xWqWJsFpoV109AW24pwJm0vlILcpFP/jno2GsRvwEpnC/GJhEA
 4+jfEj0KlFkOewp0/HcqrUJp4yDEBhnhTTYgDL3hSWgKRVorKqY4/QmpKQCmpVQk
 Qrb6k+TrnLmKBQKdqfd+PKAEC9U/9Wjg0KLPyc9btBGFNSUG3QoDigzxxIvSlW6w
 2vyanDW6780FTIi8sA7sq1cBLosIyoFG44YYwbMidVtxhBk1LMRvetNAOnAJeycp
 Abbp/A2EdvzM+ZMNMRwWlsgig6WkGY7jy/zpcmQUdALM+yT2o7D5ZwxE5pa2ggds
 jf0eER1vVCEpOL70swNSZbAnDNCzBzTN64GX9gQIjdVT+nMUYQXwuOgEmho1FshD
 bgZz4PcaOCCSTice9GYCC3C9OqXBsE2DyBwytYYzahyDiQH13Iz6wEq/+tIIjzCN
 KKRdD12TXaPobLwv5zI3kogJ1I4P1fa6RZYpkngLpMbQrQ7qiDI=
 =x0rf
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A new set of changes:
 * lots of small documentation fixes, from Jérôme Pouiller
 * beacon protection (BIGTK) support from Jouni Malinen
 * some initial code for TID configuration, from Tamizh chelvam
 * I reverted some new API before it's actually used, because
   it's wrong to mix controlled port and preauth
 * a few other cleanups/fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 15:41:54 -08:00
Colin Ian King
13ef6ae8c0 net: qrtr: fix spelling mistake "serivce" -> "service"
There is a spelling mistake in a pr_err message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 15:41:14 -08:00
Martin Varghese
571912c69f net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.
The Bareudp tunnel module provides a generic L3 encapsulation
tunnelling module for tunnelling different protocols like MPLS,
IP,NSH etc inside a UDP tunnel.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:31:42 -08:00
Jules Irenge
48851e9e80 af_unix: Add missing annotation for unix_wait_for_peer()
Sparse reports a warning unix_wait_for_peer()

warning: context imbalance in unix_wait_for_peer() - unexpected unlock

The root cause is the missing annotation at unix_wait_for_peer()
Add the missing annotation __releases(&unix_sk(other)->lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
3283ff2ea7 dccp: Add missing annotation for dccp_child_process()
Sparse reports a warning at dccp_child_process()
warning: context imbalance in dccp_child_process() - unexpected unlock
The root cause is the missing annotation at dccp_child_process()
Add the missing __releases(child) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
be21139f35 net: netrom: Add missing annotation for nr_neigh_stop()
Sparse reports a warning at nr_neigh_stop()
warning: context imbalance in nr_neigh_stop() - unexpected unlock
The root cause is the missing annotation at nr_neigh_stop()
Add the missing __releases(&nr_neigh_list_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
2d6b6acfce net: netrom: Add missing annotation for nr_neigh_start()
Sparse reports a warning at nr_neigh_start()
warning: context imbalance in nr_neigh_start() - wrong count at exit
The root cause is the missing annotation at nr_neigh_start()
Add the missing __acquires(&nr_neigh_list_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
0eb713fb66 net: netrom: Add missing annotation for nr_node_stop()
Sparse reports a warning at nr_node_stop()
warning: context imbalance in nr_node_stop() - wrong count at exit
The root cause is the missing annotation at nr_node_stop()
Add the missing __releases(&nr_node_list_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
5018adfd7a net: netrom: Add missing annotation for nr_node_start()
Sparse reports a warning at nr_node_start()
warning: context imbalance in nr_node_start() - wrong count at exit
The root cause is the missing annotation at nr_node_start()
Add the missing __acquires(&nr_node_list_lock) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
8b003f0d5c netrom: Add missing annotation for nr_info_stop()
Sparse reports a warning at nr_info_stop()
warning: context imbalance in nr_info_stop() - unexpected unlock
The root cause is the missing annotation at nr_info_stop()
Add the missing __releases(&nr_list_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
d087f18378 netrom: Add missing annotation for nr_info_start()
Sparse reports a warning at nr_info_start()
warning: context imbalance in nr_info_start() - wrong count at exit
The root cause is the missing annotation at nr_info_start()
Add the missing __acquires(&nr_list_lock)

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
8e0f8ccfb0 net: Add missing annotation for llc_seq_start()
Sparse reports a warning at llc_seq_start()

warning: context imbalance in llc_seq_start() - wrong count at exit
The root cause is the msiing annotation at llc_seq_start()

Add the missing __acquires(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
b77b4f634e sctp: Add missing annotation for sctp_transport_walk_stop()
Sparse reports a warning at sctp_transport_walk_stop()

warning: context imbalance in sctp_transport_walk_stop
	- wrong count at exit

The root cause is the missing annotation at sctp_transport_walk_stop()
Add the missing __releases(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:49 -08:00
Jules Irenge
6c72b7740c sctp: Add missing annotation for sctp_transport_walk_start()
Sparse reports a warning at sctp_transport_walk_start()

warning: context imbalance in sctp_transport_walk_start
	- wrong count at exit

The root cause is the missing annotation at sctp_transport_walk_start()
Add the missing __acquires(RCU) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:48 -08:00
Jules Irenge
887cf3d139 sctp: Add missing annotation for sctp_err_finish()
Sparse reports a warning at sctp_err_finish()
warning: context imbalance in sctp_err_finish() - unexpected unlock

The root cause is a missing annotation at sctp_err_finish()
Add the missing  __releases(&((__sk)->sk_lock.slock)) annotation

Signed-off-by: Jules Irenge <jbi.octave@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:26:48 -08:00
Amol Grover
28b380e289 ip6mr: Fix RCU list debugging warning
ip6mr_for_each_table() macro uses list_for_each_entry_rcu()
for traversing outside an RCU read side critical section
but under the protection of rtnl_mutex. Hence add the
corresponding lockdep expression to silence the following
false-positive warnings:

[    4.319479] =============================
[    4.319480] WARNING: suspicious RCU usage
[    4.319482] 5.5.4-stable #17 Tainted: G            E
[    4.319483] -----------------------------
[    4.319485] net/ipv6/ip6mr.c:1243 RCU-list traversed in non-reader section!!

[    4.456831] =============================
[    4.456832] WARNING: suspicious RCU usage
[    4.456834] 5.5.4-stable #17 Tainted: G            E
[    4.456835] -----------------------------
[    4.456837] net/ipv6/ip6mr.c:1582 RCU-list traversed in non-reader section!!

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:19:21 -08:00
Amol Grover
c8b91770f5 tcp: ipv4: Pass lockdep expression to RCU lists
md5sig->head maybe traversed using hlist_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of socket lock.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:07:03 -08:00
Madhuparna Bhowmik
0a087bf232 net: 802: psnap.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:02:53 -08:00
Amol Grover
958a93c154 tcp, ulp: Pass lockdep expression to RCU lists
tcp_ulp_list is traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of tcp_ulp_list_lock.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.t

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:01:54 -08:00
Jiri Pirko
ecd942a0ef devlink: add ACL generic packet traps
Add packet traps that can report packets that were dropped during ACL
processing.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 11:55:06 -08:00
Nikolay Aleksandrov
823d81b0fa net: bridge: fix stale eth hdr pointer in br_dev_xmit
In br_dev_xmit() we perform vlan filtering in br_allowed_ingress() but
if the packet has the vlan header inside (e.g. bridge with disabled
tx-vlan-offload) then the vlan filtering code will use skb_vlan_untag()
to extract the vid before filtering which in turn calls pskb_may_pull()
and we may end up with a stale eth pointer. Moreover the cached eth header
pointer will generally be wrong after that operation. Remove the eth header
caching and just use eth_hdr() directly, the compiler does the right thing
and calculates it only once so we don't lose anything.

Fixes: 057658cb33 ("bridge: suppress arp pkts on BR_NEIGH_SUPPRESS ports")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 11:11:19 -08:00
David Ahern
366ed1aca6 net: Remove unneeded export of a couple of xdp generic functions
generic_xdp_tx and xdp_do_generic_redirect are only used by builtin
code, so remove the EXPORT_SYMBOL_GPL for them.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 11:03:38 -08:00
Tamizh chelvam
370f51d5ed mac80211: Add api to support configuring TID specific configuration
Implement drv_set_tid_config api to allow TID specific
configuration and drv_reset_tid_config api to reset peer
specific TID configuration. This per-TID onfiguration
will be applied for all the connected stations when MAC is NULL.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-7-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 14:07:01 +01:00
Tamizh chelvam
04f7d142f5 nl80211: Add support to configure TID specific RTSCTS configuration
This patch adds support to configure per TID RTSCTS control
configuration to enable/disable through the
NL80211_TID_CONFIG_ATTR_RTSCTS_CTRL attribute.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-5-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 13:56:57 +01:00
Tamizh chelvam
ade274b23e nl80211: Add support to configure TID specific AMPDU configuration
This patch adds support to configure per TID AMPDU control
configuration to enable/disable aggregation through the
NL80211_TID_CONFIG_ATTR_AMPDU_CTRL attribute.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-4-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 13:56:49 +01:00
Tamizh chelvam
6a21d16c4d nl80211: Add support to configure TID specific retry configuration
This patch adds support to configure per TID retry configuration
through the NL80211_TID_CONFIG_ATTR_RETRY_SHORT and
NL80211_TID_CONFIG_ATTR_RETRY_LONG attributes. This TID specific
retry configuration will have more precedence than phy level
configuration.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-3-git-send-email-tamizhr@codeaurora.org
[rebase completely on top of my previous API changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 13:48:54 +01:00
Johannes Berg
3710a8a628 nl80211: modify TID-config API
Make some changes to the TID-config API:
 * use u16 in nl80211 (only, and restrict to using 8 bits for now),
   to avoid issues in the future if we ever want to use higher TIDs.
 * reject empty TIDs mask (via netlink policy)
 * change feature advertising to not use extended feature flags but
   have own mechanism for this, which simplifies the code
 * fix all variable names from 'tid' to 'tids' since it's a mask
 * change to cfg80211_ name prefixes, not ieee80211_
 * fix some minor docs/spelling things.

Change-Id: Ia234d464b3f914cdeab82f540e018855be580dce
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 12:01:10 +01:00
Tamizh chelvam
77f576deaa nl80211: Add NL command to support TID speicific configurations
Add the new NL80211_CMD_SET_TID_CONFIG command to support
data TID specific configuration. Per TID configuration is
passed in the nested NL80211_ATTR_TID_CONFIG attribute.

This patch adds support to configure per TID noack policy
through the NL80211_TID_CONFIG_ATTR_NOACK attribute.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-2-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 11:15:25 +01:00
Madhuparna Bhowmik
253216ffb2 mac80211: rx: avoid RCU list traversal under mutex
local->sta_mtx is held in __ieee80211_check_fast_rx_iface().
No need to use list_for_each_entry_rcu() as it also requires
a cond argument to avoid false lockdep warnings when not used in
RCU read-side section (with CONFIG_PROVE_RCU_LIST).
Therefore use list_for_each_entry();

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Link: https://lore.kernel.org/r/20200223143302.15390-1-madhuparnabhowmik10@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:42:38 +01:00
Johannes Berg
e3ae39edbc nl80211: explicitly include if_vlan.h
We use that here, and do seem to get it through some recursive
include, but better include it explicitly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200224093814.1b9c258fec67.I45ac150d4e11c72eb263abec9f1f0c7add9bef2b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:41:13 +01:00
Jouni Malinen
af2d14b01c mac80211: Beacon protection using the new BIGTK (STA)
This adds support for mac80211 to verify that received Beacon frames
have a valid MME in station mode when a BIGTK is configured.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-6-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:36:17 +01:00
Jouni Malinen
0a3a84360b mac80211: Beacon protection using the new BIGTK (AP)
This adds support for mac80211 to add an MME into Beacon frames in AP
mode when a BIGTK is configured.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-5-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:36:11 +01:00
Jouni Malinen
2d5d4b0a6d mac80211: Update BIP to support Beacon frames
When BIP is used to protect Beacon frames, the Timestamp field is masked
to zero. Otherwise, the BIP processing is identical to the way it was
already used with group-addressed Robust Management frames.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-4-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:36:03 +01:00
Jouni Malinen
e5473e80d4 mac80211: Support BIGTK configuration for Beacon protection
Extend mac80211 key configuration to support the new BIGTK with key
index values 6 and 7. Support for actually protecting Beacon frames
(adding the MME in AP mode and checking it in STA mode) is covered in
separate commits.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-3-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:35:57 +01:00
Jouni Malinen
56be393fa8 cfg80211: Support key configuration for Beacon protection (BIGTK)
IEEE P802.11-REVmd/D3.0 adds support for protecting Beacon frames using
a new set of keys (BIGTK; key index 6..7) similarly to the way
group-addressed Robust Management frames are protected (IGTK; key index
4..5). Extend cfg80211 and nl80211 to allow the new BIGTK to be
configured. Add an extended feature flag to indicate driver support for
the new key index values to avoid array overflows in driver
implementations and also to indicate to user space when this
functionality is available.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-2-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:35:48 +01:00
Jouni Malinen
f8af764bf1 cfg80211: More error messages for key addition failures
These were helpful while working with extensions to NL80211_CMD_NEW_KEY,
so add more explicit error reporting for additional cases that can fail
while that command is being processed.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:27:49 +01:00
Johannes Berg
8d74a623cc Revert "nl80211: add src and dst addr attributes for control port tx/rx"
This reverts commit 8c3ed7aa2b.

As Jouni points out, there's really no need for this, since the
RSN pre-authentication frames are normal data frames, not port
control frames (locally).

We can still revert this now since it hasn't actually gone beyond
-next.

Fixes: 8c3ed7aa2b ("nl80211: add src and dst addr attributes for control port tx/rx")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200224101910.b746e263287a.I9eb15d6895515179d50964dec3550c9dc784bb93@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:22:02 +01:00
Johannes Berg
8079e4fee5 Revert "mac80211: support NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211_MAC_ADDRS"
This reverts commit 9b125c2799.

As Jouni points out, there's really no need for this, since the
RSN pre-authentication frames are normal data frames, not port
control frames (locally).

Fixes: 9b125c2799 ("mac80211: support NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211_MAC_ADDRS")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200224101910.b87da63a3cd6.Ic94bc51a370c4aa7d19fbca9b96d90ab703257dc@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:21:52 +01:00
Madhuparna Bhowmik
6132c1d903 net: core: devlink.c: Hold devlink->lock from the beginning of devlink_dpipe_table_register()
devlink_dpipe_table_find() should be called under either
rcu_read_lock() or devlink->lock. devlink_dpipe_table_register()
calls devlink_dpipe_table_find() without holding the lock
and acquires it later. Therefore hold the devlink->lock
from the beginning of devlink_dpipe_table_register().

Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-23 21:17:37 -08:00
Li RongQing
c3e042f541 igmp: remove unused macro IGMP_Vx_UNSOLICITED_REPORT_INTERVAL
After commit 2690048c01 ("net: igmp: Allow user-space
configuration of igmp unsolicited report interval"), they
are not used now

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-23 21:15:36 -08:00
Neal Cardwell
dad8cea7ad tcp: fix TFO SYNACK undo to avoid double-timestamp-undo
In a rare corner case the new logic for undo of SYNACK RTO could
result in triggering the warning in tcp_fastretrans_alert() that says:
        WARN_ON(tp->retrans_out != 0);

The warning looked like:

WARNING: CPU: 1 PID: 1 at net/ipv4/tcp_input.c:2818 tcp_ack+0x13e0/0x3270

The sequence that tickles this bug is:
 - Fast Open server receives TFO SYN with data, sends SYNACK
 - (client receives SYNACK and sends ACK, but ACK is lost)
 - server app sends some data packets
 - (N of the first data packets are lost)
 - server receives client ACK that has a TS ECR matching first SYNACK,
   and also SACKs suggesting the first N data packets were lost
    - server performs TS undo of SYNACK RTO, then immediately
      enters recovery
    - buggy behavior then performed a *second* undo that caused
      the connection to be in CA_Open with retrans_out != 0

Basically, the incoming ACK packet with SACK blocks causes us to first
undo the cwnd reduction from the SYNACK RTO, but then immediately
enters fast recovery, which then makes us eligible for undo again. And
then tcp_rcv_synrecv_state_fastopen() accidentally performs an undo
using a "mash-up" of state from two different loss recovery phases: it
uses the timestamp info from the ACK of the original SYNACK, and the
undo_marker from the fast recovery.

This fix refines the logic to only invoke the tcp_try_undo_loss()
inside tcp_rcv_synrecv_state_fastopen() if the connection is still in
CA_Loss.  If peer SACKs triggered fast recovery, then
tcp_rcv_synrecv_state_fastopen() can't safely undo.

Fixes: 794200d662 ("tcp: undo cwnd on Fast Open spurious SYNACK retransmit")
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-23 17:23:35 -08:00
Paolo Abeni
39f3b41aa7 net: genetlink: return the error code when attribute parsing fails.
Currently if attribute parsing fails and the genl family
does not support parallel operation, the error code returned
by __nlmsg_parse() is discarded by genl_family_rcv_msg_attrs_parse().

Be sure to report the error for all genl families.

Fixes: c10e6cf85e ("net: genetlink: push attrbuf allocation and parsing to a separate function")
Fixes: ab5b526da0 ("net: genetlink: always allocate separate attrs for dumpit ops")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-22 21:58:33 -08:00
Matteo Croce
3e72dfdf82 ipv4: ensure rcu_read_lock() in cipso_v4_error()
Similarly to commit c543cb4a5f ("ipv4: ensure rcu_read_lock() in
ipv4_link_failure()"), __ip_options_compile() must be called under rcu
protection.

Fixes: 3da1ed7ac3 ("net: avoid use IPCB in cipso_v4_error")
Suggested-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-22 21:45:55 -08:00
Jozsef Kadlecsik
8af1c6fbd9 netfilter: ipset: Fix forceadd evaluation path
When the forceadd option is enabled, the hash:* types should find and replace
the first entry in the bucket with the new one if there are no reuseable
(deleted or timed out) entries. However, the position index was just not set
to zero and remained the invalid -1 if there were no reuseable entries.

Reported-by: syzbot+6a86565c74ebe30aea18@syzkaller.appspotmail.com
Fixes: 23c42a403a ("netfilter: ipset: Introduction of new commands and protocol version 7")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2020-02-22 12:13:45 +01:00
Jozsef Kadlecsik
f66ee0410b netfilter: ipset: Fix "INFO: rcu detected stall in hash_xxx" reports
In the case of huge hash:* types of sets, due to the single spinlock of
a set the processing of the whole set under spinlock protection could take
too long.

There were four places where the whole hash table of the set was processed
from bucket to bucket under holding the spinlock:

- During resizing a set, the original set was locked to exclude kernel side
  add/del element operations (userspace add/del is excluded by the
  nfnetlink mutex). The original set is actually just read during the
  resize, so the spinlocking is replaced with rcu locking of regions.
  However, thus there can be parallel kernel side add/del of entries.
  In order not to loose those operations a backlog is added and replayed
  after the successful resize.
- Garbage collection of timed out entries was also protected by the spinlock.
  In order not to lock too long, region locking is introduced and a single
  region is processed in one gc go. Also, the simple timer based gc running
  is replaced with a workqueue based solution. The internal book-keeping
  (number of elements, size of extensions) is moved to region level due to
  the region locking.
- Adding elements: when the max number of the elements is reached, the gc
  was called to evict the timed out entries. The new approach is that the gc
  is called just for the matching region, assuming that if the region
  (proportionally) seems to be full, then the whole set does. We could scan
  the other regions to check every entry under rcu locking, but for huge
  sets it'd mean a slowdown at adding elements.
- Listing the set header data: when the set was defined with timeout
  support, the garbage collector was called to clean up timed out entries
  to get the correct element numbers and set size values. Now the set is
  scanned to check non-timed out entries, without actually calling the gc
  for the whole set.

Thanks to Florian Westphal for helping me to solve the SOFTIRQ-safe ->
SOFTIRQ-unsafe lock order issues during working on the patch.

Reported-by: syzbot+4b0e9d4ff3cf117837e5@syzkaller.appspotmail.com
Reported-by: syzbot+c27b8d5010f45c666ed1@syzkaller.appspotmail.com
Reported-by: syzbot+68a806795ac89df3aa1c@syzkaller.appspotmail.com
Fixes: 23c42a403a ("netfilter: ipset: Introduction of new commands and protocol version 7")
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
2020-02-22 12:00:06 +01:00
David S. Miller
b105e8e281 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-02-21

The following pull-request contains BPF updates for your *net-next* tree.

We've added 25 non-merge commits during the last 4 day(s) which contain
a total of 33 files changed, 2433 insertions(+), 161 deletions(-).

The main changes are:

1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
   with reuseport BPF programs, from Jakub Sitnicki.

2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
   to specify the tracepoint/function dynamically, from Eelco Chaudron.

3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
   guided optimizations, from Daniel Xu.

4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.

5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
   BTF defined maps, from Andrii Nakryiko.

6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
   been observed that former fails in envs with low memlock, from Yonghong Song.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 15:22:45 -08:00
David S. Miller
e65ee2fb54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflict resolution of ice_virtchnl_pf.c based upon work by
Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 13:39:34 -08:00
Jakub Sitnicki
035ff358f2 net: Generate reuseport group ID on group creation
Commit 736b46027e ("net: Add ID (if needed) to sock_reuseport and expose
reuseport_lock") has introduced lazy generation of reuseport group IDs that
survive group resize.

By comparing the identifier we check if BPF reuseport program is not trying
to select a socket from a BPF map that belongs to a different reuseport
group than the one the packet is for.

Because SOCKARRAY used to be the only BPF map type that can be used with
reuseport BPF, it was possible to delay the generation of reuseport group
ID until a socket from the group was inserted into BPF map for the first
time.

Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
either generate the reuseport ID on map update, like SOCKARRAY does, or
allocate an ID from the start when reuseport group gets created.

This patch takes the latter approach to keep sockmap free of calls into
reuseport code. This streamlines the reuseport_id access as its lifetime
now matches the longevity of reuseport object.

The cost of this simplification, however, is that we allocate reuseport IDs
for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
setups. With the way identifiers are currently generated, we can have at
most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
get close to the limit, we can switch an u64 counter like sk_cookie.

Another change is that we now always call into SOCKARRAY logic to unlink
the socket from the map when unhashing or closing the socket. Previously we
did it only when at least one socket from the group was in a BPF map.

It is worth noting that this doesn't conflict with sockmap tear-down in
case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
group. sockmap tear-down happens first:

  prot->unhash
  `- tcp_bpf_unhash
     |- tcp_bpf_remove
     |  `- while (sk_psock_link_pop(psock))
     |     `- sk_psock_unlink
     |        `- sock_map_delete_from_link
     |           `- __sock_map_delete
     |              `- sock_map_unref
     |                 `- sk_psock_put
     |                    `- sk_psock_drop
     |                       `- rcu_assign_sk_user_data(sk, NULL)
     `- inet_unhash
        `- reuseport_detach_sock
           `- bpf_sk_reuseport_detach
              `- WRITE_ONCE(sk->sk_user_data, NULL)

Suggested-by: Martin Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
9fed9000c5 bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH
SOCKMAP & SOCKHASH now support storing references to listening
sockets. Nothing keeps us from using these map types a collection of
sockets to select from in BPF reuseport programs. Whitelist the map types
with the bpf_sk_select_reuseport helper.

The restriction that the socket has to be a member of a reuseport group
still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
set are not a valid target and we signal it with -EINVAL.

The main benefit from this change is that, in contrast to
REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
listening socket can be just one BPF map at the same time.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
1d59f3bcee bpf, sockmap: Let all kernel-land lookup values in SOCKMAP/SOCKHASH
Don't require the kernel code, like BPF helpers, that needs access to
SOCK{MAP,HASH} map contents to live in net/core/sock_map.c. Expose the
lookup operation to all kernel-land.

Lookup from BPF context is not whitelisted yet. While syscalls have a
dedicated lookup handler.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-8-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
c1cdf65da0 bpf, sockmap: Return socket cookie on lookup from syscall
Tooling that populates the SOCK{MAP,HASH} with sockets from user-space
needs a way to inspect its contents. Returning the struct sock * that the
map holds to user-space is neither safe nor useful. An approach established
by REUSEPORT_SOCKARRAY is to return a socket cookie (a unique identifier)
instead.

Since socket cookies are u64 values, SOCK{MAP,HASH} need to support such a
value size for lookup to be possible. This requires special handling on
update, though. Attempts to do a lookup on a map holding u32 values will be
met with ENOSPC error.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-7-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
6e830c2f6c bpf, sockmap: Don't set up upcalls and progs for listening sockets
Now that sockmap/sockhash can hold listening sockets, when setting up the
psock we will (i) grab references to verdict/parser progs, and (2) override
socket upcalls sk_data_ready and sk_write_space.

However, since we cannot redirect to listening sockets so we don't need to
link the socket to the BPF progs. And more importantly we don't want the
listening socket to have overridden upcalls because they would get
inherited by child sockets cloned from it.

Introduce a separate initialization path for listening sockets that does
not change the upcalls and ignores the BPF progs.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-6-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
8ca30379a4 bpf, sockmap: Allow inserting listening TCP sockets into sockmap
In order for sockmap/sockhash types to become generic collections for
storing TCP sockets we need to loosen the checks during map update, while
tightening the checks in redirect helpers.

Currently sock{map,hash} require the TCP socket to be in established state,
which prevents inserting listening sockets.

Change the update pre-checks so the socket can also be in listening state.

Since it doesn't make sense to redirect with sock{map,hash} to listening
sockets, add appropriate socket state checks to BPF redirect helpers too.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-5-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
e80251555f tcp_bpf: Don't let child socket inherit parent protocol ops on copy
Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before it gets hashed and any
of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
f1ff5ce2cd net, sk_msg: Clear sk_user_data pointer on clone if tagged
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
b8e202d1d1 net, sk_msg: Annotate lockless access to sk_prot on clone
sk_msg and ULP frameworks override protocol callbacks pointer in
sk->sk_prot, while tcp accesses it locklessly when cloning the listening
socket, that is with neither sk_lock nor sk_callback_lock held.

Once we enable use of listening sockets with sockmap (and hence sk_msg),
there will be shared access to sk->sk_prot if socket is getting cloned
while being inserted/deleted to/from the sockmap from another CPU:

Read side:

tcp_v4_rcv
  sk = __inet_lookup_skb(...)
  tcp_check_req(sk)
    inet_csk(sk)->icsk_af_ops->syn_recv_sock
      tcp_v4_syn_recv_sock
        tcp_create_openreq_child
          inet_csk_clone_lock
            sk_clone_lock
              READ_ONCE(sk->sk_prot)

Write side:

sock_map_ops->map_update_elem
  sock_map_update_elem
    sock_map_update_common
      sock_map_link_no_progs
        tcp_bpf_init
          tcp_bpf_update_sk_prot
            sk_psock_update_proto
              WRITE_ONCE(sk->sk_prot, ops)

sock_map_ops->map_delete_elem
  sock_map_delete_elem
    __sock_map_delete
     sock_map_unref
       sk_psock_put
         sk_psock_drop
           sk_psock_restore_proto
             tcp_update_ulp
               WRITE_ONCE(sk->sk_prot, proto)

Mark the shared access with READ_ONCE/WRITE_ONCE annotations.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-2-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Linus Torvalds
3dc55dba67 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
    Cong Wang.

 2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
    is finished, from Magnus Karlsson.

 3) Set table field properly to RT_TABLE_COMPAT when necessary, from
    Jethro Beekman.

 4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
    that in IFLA_ALT_IFNAME. From Eric Dumazet.

 5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

 6) Handle mtu==0 devices properly in wireguard, from Jason A.
    Donenfeld.

 7) Fix several lockdep warnings in bonding, from Taehee Yoo.

 8) Fix cls_flower port blocking, from Jason Baron.

 9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

10) Fix RDMA race in qede driver, from Michal Kalderon.

11) Fix several false lockdep warnings by adding conditions to
    list_for_each_entry_rcu(), from Madhuparna Bhowmik.

12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

14) Hey, variables declared in switch statement before any case
    statements are not initialized. I learn something every day. Get
    rids of this stuff in several parts of the networking, from Kees
    Cook.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
  bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
  bnxt_en: Improve device shutdown method.
  net: netlink: cap max groups which will be considered in netlink_bind()
  net: thunderx: workaround BGX TX Underflow issue
  ionic: fix fw_status read
  net: disable BRIDGE_NETFILTER by default
  net: macb: Properly handle phylink on at91rm9200
  s390/qeth: fix off-by-one in RX copybreak check
  s390/qeth: don't warn for napi with 0 budget
  s390/qeth: vnicc Fix EOPNOTSUPP precedence
  openvswitch: Distribute switch variables for initialization
  net: ip6_gre: Distribute switch variables for initialization
  net: core: Distribute switch variables for initialization
  udp: rehash on disconnect
  net/tls: Fix to avoid gettig invalid tls record
  bpf: Fix a potential deadlock with bpf_map_do_batch
  bpf: Do not grab the bucket spinlock by default on htab batch ops
  ice: Wait for VF to be reset/ready before configuration
  ice: Don't tell the OS that link is going down
  ice: Don't reject odd values of usecs set by user
  ...
2020-02-21 11:59:51 -08:00
Manivannan Sadhasivam
31d6cbeeb8 net: qrtr: Fix the local node ID as 1
In order to start the QRTR nameservice, the local node ID needs to be
valid. Hence, fix it to 1. Previously, the node ID was configured through
a userspace tool before starting the nameservice daemon. Since we have now
integrated the nameservice handling to kernel, this change is necessary
for making it functional.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 11:59:12 -08:00
Manivannan Sadhasivam
0c2204a4ad net: qrtr: Migrate nameservice to kernel from userspace
The QRTR nameservice has been maintained in userspace for some time. This
commit migrates it to Linux kernel. This change is required in order to
eliminate the need of starting a userspace daemon for making the WiFi
functional for ath11k based devices. Since the QRTR NS is not usually
packed in most of the distros, users need to clone, build and install it
to get the WiFi working. It will become a hassle when the user doesn't
have any other source of network connectivity.

Signed-off-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 11:59:12 -08:00
Emmanuel Grumbach
7bb106eb68 cfg80211: remove support for adjacent channel compensation
The only driver that used that was iwlwifi and it removed
support for this. Remove the feature here as well.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200214232336.a530de38e511.I393bc395f6037c8cca6421ed550e3072dc248aed@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 20:00:39 +01:00
Johannes Berg
ca98c47d54 mac80211: check vif pointer before airtime calculation
In case of monitor mode injection, vif may be NULL, don't crash
on that in ieee80211_calc_expected_tx_airtime().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200221104544.dddb7a3568fd.I0ede2733a3c76e95daeab07538449ea847e7b78d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 12:11:39 +01:00
Amol Grover
3ee9306b35 cfg80211: Pass lockdep expression to RCU lists
rdev->sched_scan_req_list maybe traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of rtnl_mutex.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.

Signed-off-by: Amol Grover <frextrite@gmail.com>
Link: https://lore.kernel.org/r/20200219091102.10709-1-frextrite@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 12:11:30 +01:00
Andrei Otcheretianski
0daa63ed4c mac80211: Remove a redundant mutex unlock
The below-mentioned commit changed the code to unlock *inside*
the function, but previously the unlock was *outside*. It failed
to remove the outer unlock, however, leading to double unlock.

Fix this.

Fixes: 33483a6b88 ("mac80211: fix missing unlock on error in ieee80211_mark_sta_auth()")
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Link: https://lore.kernel.org/r/20200221104719.cce4741cf6eb.I671567b185c8a4c2409377e483fd149ce590f56d@changeid
[rewrite commit message to better explain what happened]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 10:59:36 +01:00
Johannes Berg
a7ee7d44b5 cfg80211: check reg_rule for NULL in handle_channel_custom()
We may end up with a NULL reg_rule after the loop in
handle_channel_custom() if the bandwidth didn't fit,
check if this is the case and bail out if so.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200221104449.3b558a50201c.I4ad3725c4dacaefd2d18d3cc65ba6d18acd5dbfe@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 10:59:27 +01:00
Johannes Berg
9951ebfcdf nl80211: fix potential leak in AP start
If nl80211_parse_he_obss_pd() fails, we leak the previously
allocated ACL memory. Free it in this case.

Fixes: 796e90f42b ("cfg80211: add support for parsing OBBS_PD attributes")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200221104142.835aba4cdd14.I1923b55ba9989c57e13978f91f40bfdc45e60cbd@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 10:50:26 +01:00
Nikolay Aleksandrov
3a20773bee net: netlink: cap max groups which will be considered in netlink_bind()
Since nl_groups is a u32 we can't bind more groups via ->bind
(netlink_bind) call, but netlink has supported more groups via
setsockopt() for a long time and thus nlk->ngroups could be over 32.
Recently I added support for per-vlan notifications and increased the
groups to 33 for NETLINK_ROUTE which exposed an old bug in the
netlink_bind() code causing out-of-bounds access on archs where unsigned
long is 32 bits via test_bit() on a local variable. Fix this by capping the
maximum groups in netlink_bind() to BITS_PER_TYPE(u32), effectively
capping them at 32 which is the minimum of allocated groups and the
maximum groups which can be bound via netlink_bind().

CC: Christophe Leroy <christophe.leroy@c-s.fr>
CC: Richard Guy Briggs <rgb@redhat.com>
Fixes: 4f52090052 ("netlink: have netlink per-protocol bind function return an error code.")
Reported-by: Erhard F. <erhard_f@mailbox.org>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 16:02:08 -08:00
Roman Kiryanov
98bda63e20 net: disable BRIDGE_NETFILTER by default
The description says 'If unsure, say N.' but
the module is built as M by default (once
the dependencies are satisfied).

When the module is selected (Y or M), it enables
NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS
which alter kernel internal structures.

We (Android Studio Emulator) currently do not
use this module and think this it is more consistent
to have it disabled by default as opposite to
disabling it explicitly to prevent enabling
NETFILTER_FAMILY_BRIDGE and SKB_EXTENSIONS.

Signed-off-by: Roman Kiryanov <rkir@google.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 15:02:02 -08:00
Julian Wiedmann
2e92a2d0e4 net: use netif_is_bridge_port() to check for IFF_BRIDGE_PORT
Trivial cleanup, so that all bridge port-specific code can be found in
one go.

CC: Johannes Berg <johannes@sipsolutions.net>
CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:10:32 -08:00
Ilias Apalodimas
458de8a97f net: page_pool: API cleanup and comments
Functions starting with __ usually indicate those which are exported,
but should not be called directly. Update some of those declared in the
API and make it more readable.

page_pool_unmap_page() and page_pool_release_page() were doing
exactly the same thing calling __page_pool_clean_page().  Let's
rename __page_pool_clean_page() to page_pool_release_page() and
export it in order to show up on perf logs and get rid of
page_pool_unmap_page().

Finally rename __page_pool_put_page() to page_pool_put_page() since we
can now directly call it from drivers and rename the existing
page_pool_put_page() to page_pool_put_full_page() since they do the same
thing but the latter is trying to sync the full DMA area.

This patch also updates netsec, mvneta and stmmac drivers which use
those functions.

Suggested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:09:25 -08:00
Li RongQing
807ea87032 net: remove unused macro from fib_trie.c
TNODE_KMALLOC_MAX and VERSION are not used, so remove them

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:02:23 -08:00
Li RongQing
94e512de3e net: neigh: remove unused NEIGH_SYSCTL_MS_JIFFIES_ENTRY
this macro is never used, so remove it

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:02:23 -08:00
Kees Cook
16a556eeb7 openvswitch: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/openvswitch/flow_netlink.c: In function ‘validate_set’:
net/openvswitch/flow_netlink.c:2711:29: warning: statement will never be executed [-Wswitch-unreachable]
 2711 |  const struct ovs_key_ipv4 *ipv4_key;
      |                             ^~~~~~~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook
46d30cb104 net: ip6_gre: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/ipv6/ip6_gre.c: In function ‘ip6gre_err’:
net/ipv6/ip6_gre.c:440:32: warning: statement will never be executed [-Wswitch-unreachable]
  440 |   struct ipv6_tlv_tnl_enc_lim *tel;
      |                                ^~~

net/ipv6/ip6_tunnel.c: In function ‘ip6_tnl_err’:
net/ipv6/ip6_tunnel.c:520:32: warning: statement will never be executed [-Wswitch-unreachable]
  520 |   struct ipv6_tlv_tnl_enc_lim *tel;
      |                                ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Kees Cook
161d179261 net: core: Distribute switch variables for initialization
Variables declared in a switch statement before any case statements
cannot be automatically initialized with compiler instrumentation (as
they are not part of any execution flow). With GCC's proposed automatic
stack variable initialization feature, this triggers a warning (and they
don't get initialized). Clang's automatic stack variable initialization
(via CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also
doesn't initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase,
so even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where
they're used or lift them up into the main function body.

net/core/skbuff.c: In function ‘skb_checksum_setup_ip’:
net/core/skbuff.c:4809:7: warning: statement will never be executed [-Wswitch-unreachable]
 4809 |   int err;
      |       ^~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:00:19 -08:00
Paul Blakey
af699626ee net: sched: Support specifying a starting chain via tc skb ext
Set the starting chain from the tc skb ext chain value. Once we read
the tc skb ext, delete it, so cloned/redirect packets won't inherit it.

In order to lookup a chain by the chain index on the ingress block
at ingress classification, provide a lookup function.

Co-developed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19 17:49:48 -08:00
Paul Blakey
4371929819 net: sched: Change the block's chain list to an rcu list
To allow lookup of a block's chain under atomic context.

Co-developed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19 17:49:48 -08:00
Paul Blakey
7d17c544cd net: sched: Pass ingress block to tcf_classify_ingress
On ingress and cls_act qdiscs init, save the block on ingress
mini_Qdisc and and pass it on to ingress classification, so it
can be used for the looking up a specified chain index.

Co-developed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19 17:49:48 -08:00
Paul Blakey
9410c9409d net: sched: Introduce ingress classification function
TC multi chain configuration can cause offloaded tc chains to miss in
hardware after jumping to some chain. In such cases the software should
continue from the chain that missed in hardware, as the hardware may
have manipulated the packet and updated some counters.

Currently a single tcf classification function serves both ingress and
egress. However, multi chain miss processing (get tc skb extension on
hw miss, set tc skb extension on tc miss) should happen only on
ingress.

Refactor the code to use ingress classification function, and move setting
the tc skb extension from general classification to it, as a prestep
for supporting the hw miss scenario.

Co-developed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19 17:49:48 -08:00
David S. Miller
41f57cfde1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-02-19

The following pull-request contains BPF updates for your *net* tree.

We've added 10 non-merge commits during the last 10 day(s) which contain
a total of 10 files changed, 93 insertions(+), 31 deletions(-).

The main changes are:

1) batched bpf hashtab fixes from Brian and Yonghong.

2) various selftests and libbpf fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:42:35 -08:00
Willem de Bruijn
303d0403b8 udp: rehash on disconnect
As of the below commit, udp sockets bound to a specific address can
coexist with one bound to the any addr for the same port.

The commit also phased out the use of socket hashing based only on
port (hslot), in favor of always hashing on {addr, port} (hslot2).

The change broke the following behavior with disconnect (AF_UNSPEC):

    server binds to 0.0.0.0:1337
    server connects to 127.0.0.1:80
    server disconnects
    client connects to 127.0.0.1:1337
    client sends "hello"
    server reads "hello"	// times out, packet did not find sk

On connect the server acquires a specific source addr suitable for
routing to its destination. On disconnect it reverts to the any addr.

The connect call triggers a rehash to a different hslot2. On
disconnect, add the same to return to the original hslot2.

Skip this step if the socket is going to be unhashed completely.

Fixes: 4cdeeee925 ("net: udp: prefer listeners bound to an address")
Reported-by: Pavel Roskin <plroskin@gmail.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:34:11 -08:00
Rohit Maheshwari
06f5201c63 net/tls: Fix to avoid gettig invalid tls record
Current code doesn't check if tcp sequence number is starting from (/after)
1st record's start sequnce number. It only checks if seq number is before
1st record's end sequnce number. This problem will always be a possibility
in re-transmit case. If a record which belongs to a requested seq number is
already deleted, tls_get_record will start looking into list and as per the
check it will look if seq number is before the end seq of 1st record, which
will always be true and will return 1st record always, it should in fact
return NULL.
As part of the fix, start looking each record only if the sequence number
lies in the list else return NULL.
There is one more check added, driver look for the start marker record to
handle tcp packets which are before the tls offload start sequence number,
hence return 1st record if the record is tls start marker and seq number is
before the 1st record's starting sequence number.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Rohit Maheshwari <rohitm@chelsio.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 16:32:06 -08:00
Howard Chung
eed467b517 Bluetooth: fix passkey uninitialized when used
This patch fix the issue: warning:variable 'passkey' is uninitialized
when used here

Link: https://groups.google.com/forum/#!topic/clang-built-linux/kyRKCjRsGoU

Fixes: cee5f20fec ("Bluetooth: secure bluetooth stack from bluedump attack")
Reported-by: kbuild test robot <lkp@intel.com>
Suggested-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Howard Chung <howardchung@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-19 20:35:46 +01:00
Luiz Augusto von Dentz
1c22d3cda8 Bluetooth: RFCOMM: Use MTU auto tune logic
This reuse the L2CAP MTU auto logic to select the MTU used for RFCOMM
channels, this should increase the maximum from 1013 to 1021 when 3-DH5
is supported.

Since it does not set an L2CAP MTU we no longer need a debugfs so that
is removed.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-19 20:31:45 +01:00
Madhuparna Bhowmik
33c4acbe2f bridge: br_stp: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 11:13:43 -08:00
Christian Brauner
9cb8e048e5 net/ipv4/sysctl: show tcp_{allowed, available}_congestion_control in non-initial netns
It is currenty possible to switch the TCP congestion control algorithm
in non-initial network namespaces:

unshare -U --map-root --net --fork --pid --mount-proc
echo "reno" > /proc/sys/net/ipv4/tcp_congestion_control

works just fine. But currently non-initial network namespaces have no
way of kowing which congestion algorithms are available or allowed other
than through trial and error by writing the names of the algorithms into
the aforementioned file.
Since we already allow changing the congestion algorithm in non-initial
network namespaces by exposing the tcp_congestion_control file there is
no reason to not also expose the
tcp_{allowed,available}_congestion_control files to non-initial network
namespaces. After this change a container with a separate network
namespace will show:

root@f1:~# ls -al /proc/sys/net/ipv4/tcp_* | grep congestion
-rw-r--r-- 1 root root 0 Feb 19 11:54 /proc/sys/net/ipv4/tcp_allowed_congestion_control
-r--r--r-- 1 root root 0 Feb 19 11:54 /proc/sys/net/ipv4/tcp_available_congestion_control
-rw-r--r-- 1 root root 0 Feb 19 11:54 /proc/sys/net/ipv4/tcp_congestion_control

Link: https://github.com/lxc/lxc/issues/3267
Reported-by: Haw Loeung <haw.loeung@canonical.com>
Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 11:04:31 -08:00
Amol Grover
a7a9456e8d net: hsr: Pass lockdep expression to RCU lists
node_db is traversed using list_for_each_entry_rcu
outside an RCU read-side critical section but under the protection
of hsr->list_lock.

Hence, add corresponding lockdep expression to silence false-positive
warnings, and harden RCU lists.

Signed-off-by: Amol Grover <frextrite@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 10:55:42 -08:00
Raed Salem
dda520c4d4 ESP: Export esp_output_fill_trailer function
The esp fill trailer method is identical for both
IPv6 and IPv4.

Share the implementation for esp6 and esp to avoid
code duplication in addition it could be also used
at various drivers code.

Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-19 13:52:32 +01:00
Huang Zijiang
a4c278d1be xfrm: Use kmem_cache_zalloc() instead of kmem_cache_alloc() with flag GFP_ZERO.
Use kmem_cache_zalloc instead of manually setting kmem_cache_alloc
with flag GFP_ZERO since kzalloc sets allocated memory
to zero.

Change in v2:
     add indation

Signed-off-by: Huang Zijiang <huang.zijiang@zte.com.cn>
Signed-off-by: Yi Wang <wang.yi59@zte.com.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-19 09:33:56 +01:00
Aya Levin
f623e59705 ethtool: Add support for low latency RS FEC
Add support for low latency Reed Solomon FEC as LLRS.

The LL-FEC is defined by the 25G/50G ethernet consortium,
in the document titled "Low Latency Reed Solomon Forward Error Correction"

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
CC: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
2020-02-18 19:17:31 -08:00
Aya Levin
573ed90aa5 devlink: Force enclosing array on binary fmsg data
Add a new API for start/end binary array brackets [] to force array
around binary data as required from JSON. With this restriction, re-open
API to set binary fmsg data.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-18 19:17:28 -08:00
David S. Miller
7c8c1697c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This batch contains Netfilter fixes for net:

1) Restrict hashlimit size to 1048576, from Cong Wang.

2) Check for offload flags from nf_flow_table_offload_setup(),
   this fixes a crash in case the hardware offload is disabled.
   From Florian Westphal.

3) Three preparation patches to extend the conntrack clash resolution,
   from Florian.

4) Extend clash resolution to deal with DNS packets from the same flow
   racing to set up the NAT configuration.

5) Small documentation fix in pipapo, from Stefano Brivio.

6) Remove misleading unlikely() from pipapo_refill(), also from Stefano.

7) Reduce hashlimit mutex scope, from Cong Wang. This patch is actually
   triggering another problem, still under discussion, another patch to
   fix this will follow up.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 15:44:13 -08:00
Stefano Brivio
9a7712048f netfilter: nft_set_pipapo: Don't abuse unlikely() in pipapo_refill()
I originally used unlikely() in the if (match_only) clause, which
we hit on the mapping table for the last field in a set, to ensure
we avoid branching to the rest of for loop body, which is executed
more frequently.

However, Pablo reports, this is confusing as it gives the impression
that this is not a common case, and it's actually not the intended
usage of unlikely().

I couldn't observe any statistical difference in matching rates on
x864_64 and aarch64 without it, so just drop it.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-18 22:07:09 +01:00
Stefano Brivio
bd97ad51a7 netfilter: nft_set_pipapo: Fix mapping table example in comments
In both insertion and lookup examples, the two element pointers
of rule mapping tables were swapped. Fix that.

Reported-by: Pablo Neira Ayuso <pablo@netfilter.org>
Fixes: 3c4287f620 ("nf_tables: Add set type for arbitrary concatenation of ranges")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-18 22:06:54 +01:00
Luiz Augusto von Dentz
a2a8b0b4ad Bluetooth: Fix crash when using new BT_PHY option
This fixes the invalid check for connected socket which causes the
following trace due to sco_pi(sk)->conn being NULL:

RIP: 0010:sco_sock_getsockopt+0x2ff/0x800 net/bluetooth/sco.c:966

L2CAP has also been fixed since it has the same problem.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-18 22:02:15 +01:00
Madhuparna Bhowmik
a2cfb96cc3 flow_table.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:27 -08:00
Madhuparna Bhowmik
53742e69e8 datapath.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:27 -08:00
Madhuparna Bhowmik
fed48423f1 vport.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:27 -08:00
Madhuparna Bhowmik
7790614616 meter.c: Use built-in RCU list checking
hlist_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:46:26 -08:00
Madhuparna Bhowmik
9facfdb546 netlabel_domainhash.c: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:44:23 -08:00
Madhuparna Bhowmik
8c70c3d728 net: netlabel: Use built-in RCU list checking
list_for_each_entry_rcu() has built-in RCU and lock checking.

Pass cond argument to list_for_each_entry_rcu() to silence
false lockdep warning when CONFIG_PROVE_RCU_LIST is enabled
by default.

Signed-off-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:43:06 -08:00
Paolo Abeni
379349e9bc Revert "net: dev: introduce support for sch BYPASS for lockless qdisc"
This reverts commit ba27b4cdaa

Ahmed reported ouf-of-order issues bisected to commit ba27b4cdaa
("net: dev: introduce support for sch BYPASS for lockless qdisc").
I can't find any working solution other than a plain revert.

This will introduce some minor performance regressions for
pfifo_fast qdisc. I plan to address them in net-next with more
indirect call wrapper boilerplate for qdiscs.

Reported-by: Ahmad Fatoum <a.fatoum@pengutronix.de>
Fixes: ba27b4cdaa ("net: dev: introduce support for sch BYPASS for lockless qdisc")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:37:45 -08:00
Florian Westphal
d99bfed58d mptcp: fix bogus socket flag values
Dan Carpenter reports static checker warnings due to bogus BIT() usage:

net/mptcp/subflow.c:571 subflow_write_space() warn: test_bit() takes a bit number
net/mptcp/subflow.c:694 subflow_state_change() warn: test_bit() takes a bit number
net/mptcp/protocol.c:261 ssk_check_wmem() warn: test_bit() takes a bit number
[..]

This is harmless (we use bits 1 & 2 instead of 0 and 1), but would
break eventually when adding BIT(5) (or 6, depends on size of 'long').

Just use 0 and 1, the values are only passed to test/set/clear_bit
functions.

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-18 12:05:53 -08:00
Sathish Narsimman
05bd80a104 Bluetooth: Disable Extended Adv if enabled
Disabling LEGACY_ADV when EXT_ADV is enabled causes
'command disallowed' during DIRECTED_ADV. This Patch fixes this
issue.

Signed-off-by: Sathish Narsimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-18 09:25:19 +01:00
Sven Eckelmann
8e8ce08198 batman-adv: Don't schedule OGM for disabled interface
A transmission scheduling for an interface which is currently dropped by
batadv_iv_ogm_iface_disable could still be in progress. The B.A.T.M.A.N. V
is simply cancelling the workqueue item in an synchronous way but this is
not possible with B.A.T.M.A.N. IV because the OGM submissions are
intertwined.

Instead it has to stop submitting the OGM when it detect that the buffer
pointer is set to NULL.

Reported-by: syzbot+a98f2016f40b9cd3818a@syzkaller.appspotmail.com
Reported-by: syzbot+ac36b6a33c28a491e929@syzkaller.appspotmail.com
Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-02-18 09:07:55 +01:00
Xin Long
245709ec8b sctp: move the format error check out of __sctp_sf_do_9_1_abort
When T2 timer is to be stopped, the asoc should also be deleted,
otherwise, there will be no chance to call sctp_association_free
and the asoc could last in memory forever.

However, in sctp_sf_shutdown_sent_abort(), after adding the cmd
SCTP_CMD_TIMER_STOP for T2 timer, it may return error due to the
format error from __sctp_sf_do_9_1_abort() and miss adding
SCTP_CMD_ASSOC_FAILED where the asoc will be deleted.

This patch is to fix it by moving the format error check out of
__sctp_sf_do_9_1_abort(), and do it before adding the cmd
SCTP_CMD_TIMER_STOP for T2 timer.

Thanks Hangbin for reporting this issue by the fuzz testing.

v1->v2:
  - improve the comment in the code as Marcelo's suggestion.

Fixes: 96ca468b86 ("sctp: check invalid value of length parameter in error cause")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 21:56:03 -08:00
Jason Baron
8a9093c798 net: sched: correct flower port blocking
tc flower rules that are based on src or dst port blocking are sometimes
ineffective due to uninitialized stack data. __skb_flow_dissect() extracts
ports from the skb for tc flower to match against. However, the port
dissection is not done when when the FLOW_DIS_IS_FRAGMENT bit is set in
key_control->flags. All callers of __skb_flow_dissect(), zero-out the
key_control field except for fl_classify() as used by the flower
classifier. Thus, the FLOW_DIS_IS_FRAGMENT may be set on entry to
__skb_flow_dissect(), since key_control is allocated on the stack
and may not be initialized.

Since key_basic and key_control are present for all flow keys, let's
make sure they are initialized.

Fixes: 62230715fd ("flow_dissector: do not dissect l4 ports for fragments")
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 21:33:28 -08:00
Gustavo A. R. Silva
2b73812483 net: netlink: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:06 -08:00
Gustavo A. R. Silva
fbfc8502af net: switchdev: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:06 -08:00
Gustavo A. R. Silva
45a4296b6e bpf, sockmap: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:05 -08:00
Gustavo A. R. Silva
9814428a44 NFC: digital: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 19:05:05 -08:00
Ursula Braun
5613f20c93 net/smc: reduce port_event scheduling
IB event handlers schedule the port event worker for further
processing of port state changes. This patch reduces the number of
schedules to avoid duplicate processing of the same port change.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul
5f78fe968d net/smc: simplify normal link termination
smc_lgr_terminate() and smc_lgr_terminate_sched() both result in soft
link termination, smc_lgr_terminate_sched() is scheduling a worker for
this task. Take out complexity by always using the termination worker
and getting rid of smc_lgr_terminate() completely.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul
ba95206042 net/smc: remove unused parameter of smc_lgr_terminate()
The soft parameter of smc_lgr_terminate() is not used and obsolete.
Remove it.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul
3739707c45 net/smc: do not delete lgr from list twice
When 2 callers call smc_lgr_terminate() at the same time
for the same lgr, one gets the lgr_lock and deletes the lgr from the
list and releases the lock. Then the second caller gets the lock and
tries to delete it again.
In smc_lgr_terminate() add a check if the link group lgr is already
deleted from the link group list and prevent to try to delete it a
second time.
And add a check if the lgr is marked as freeing, which means that a
termination is already pending.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul
354ea2baa3 net/smc: use termination worker under send_lock
smc_tx_rdma_write() is called under the send_lock and should not call
smc_lgr_terminate() directly. Call smc_lgr_terminate_sched() instead
which schedules a worker.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Karsten Graul
55dd575817 net/smc: improve smc_lgr_cleanup()
smc_lgr_cleanup() is called during termination processing, there is no
need to send a DELETE_LINK at that time. A DELETE_LINK should have been
sent before the termination is initiated, if needed.
And remove the extra call to wake_up(&lnk->wr_reg_wait) because
smc_llc_link_inactive() already calls the related helper function
smc_wr_wakeup_reg_wait().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:50:24 -08:00
Julian Wiedmann
583cb0b412 net: bridge: teach ndo_dflt_bridge_getlink() more brport flags
This enables ndo_dflt_bridge_getlink() to report a bridge port's
offload settings for multicast and broadcast flooding.

CC: Roopa Prabhu <roopa@cumulusnetworks.com>
CC: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:36:40 -08:00
Julian Wiedmann
bd706ff8ea net: vlan: suppress "failed to kill vid" warnings
When a real dev unregisters, vlan_device_event() also unregisters all
of its vlan interfaces. For each VID this ends up in __vlan_vid_del(),
which attempts to remove the VID from the real dev's VLAN filter.

But the unregistering real dev might no longer be able to issue the
required IOs, and return an error. Subsequently we raise a noisy warning
msg that is not appropriate for this situation: the real dev is being
torn down anyway, there shouldn't be any worry about cleanly releasing
all of its HW-internal resources.

So to avoid scaring innocent users, suppress this warning when the
failed deletion happens on an unregistering device.
While at it also convert the raw pr_warn() to a more fitting
netdev_warn().

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:30:54 -08:00
Vlad Buslov
b15e7a6e8d net: sched: don't take rtnl lock during flow_action setup
Refactor tc_setup_flow_action() function not to use rtnl lock and remove
'rtnl_held' argument that is no longer needed.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:17:02 -08:00
Vlad Buslov
7a47281439 net: sched: lock action when translating it to flow_action infra
In order to remove dependency on rtnl lock, take action's tcfa_lock when
constructing its representation as flow_action_entry structure.

Refactor tcf_sample_get_group() to assume that caller holds tcf_lock and
don't take it manually. This callback is only called from flow_action infra
representation translator which now calls it with tcf_lock held, so this
refactoring is necessary to prevent deadlock.

Allocate memory with GFP_ATOMIC flag for ip_tunnel_info copy because
tcf_tunnel_info_copy() is only called from flow_action representation infra
code with tcf_lock spinlock taken.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:17:02 -08:00
Gustavo A. R. Silva
5f27eb055d batman-adv: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Sven Eckelman <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-02-17 22:43:42 +01:00
Sven Eckelmann
3b2582c7af batman-adv: Avoid RCU list-traversal in spinlock
The new CONFIG_PROVE_RCU_LIST requires a condition statement in
(h)list_for_each_entry_rcu when the code might be executed in a non RCU
non-reader section with the writer lock. Otherwise lockdep might cause a
false positive warning like

  =============================
  WARNING: suspicious RCU usage
  -----------------------------
  translation-table.c:940 RCU-list traversed in non-reader section!!

batman-adv is (mostly) following the examples from the RCU documentation
and is using the normal list-traversal primitives instead of the RCU
list-traversal primitives when the writer (spin)lock is held.

The remaining users of RCU list-traversal primitives with writer spinlock
have to be converted to the same style as the rest of the code.

Reported-by: Madhuparna Bhowmik <madhuparnabhowmik10@gmail.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-02-17 16:07:50 +01:00
Florian Westphal
6a757c07e5 netfilter: conntrack: allow insertion of clashing entries
This patch further relaxes the need to drop an skb due to a clash with
an existing conntrack entry.

Current clash resolution handles the case where the clash occurs between
two identical entries (distinct nf_conn objects with same tuples), i.e.:

                    Original                        Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353

... existing handling will discard the unconfirmed clashing entry and
makes skb->_nfct point to the existing one.  The skb can then be
processed normally just as if the clash would not have existed in the
first place.

For other clashes, the skb needs to be dropped.
This frequently happens with DNS resolvers that send A and AAAA queries
back-to-back when NAT rules are present that cause packets to get
different DNAT transformations applied, for example:

-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.6:5353
-m statistics --mode random ... -j DNAT --dnat-to 10.0.0.7:5353

In this case the A or AAAA query is dropped which incurs a costly
delay during name resolution.

This patch also allows this collision type:
                       Original                   Reply
existing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.6:5353
clashing: 10.2.3.4:42 -> 10.8.8.8:53      10.2.3.4:42 <- 10.0.0.7:5353

In this case, clash is in original direction -- the reply direction
is still unique.

The change makes it so that when the 2nd colliding packet is received,
the clashing conntrack is tagged with new IPS_NAT_CLASH_BIT, gets a fixed
1 second timeout and is inserted in the reply direction only.

The entry is hidden from 'conntrack -L', it will time out quickly
and it can be early dropped because it will never progress to the
ASSURED state.

To avoid special-casing the delete code path to special case
the ORIGINAL hlist_nulls node, a new helper, "hlist_nulls_add_fake", is
added so hlist_nulls_del() will work.

Example:

      CPU A:                               CPU B:
1.  10.2.3.4:42 -> 10.8.8.8:53 (A)
2.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
3.  Apply DNAT, reply changed to 10.0.0.6
4.                                         10.2.3.4:42 -> 10.8.8.8:53 (AAAA)
5.                                         Apply DNAT, reply changed to 10.0.0.7
6. confirm/commit to conntrack table, no collisions
7.                                         commit clashing entry

Reply comes in:

10.2.3.4:42 <- 10.0.0.6:5353 (A)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
10.2.3.4:42 <- 10.0.0.7:5353 (AAAA)
 -> Finds a conntrack, DNAT is reversed & packet forwarded to 10.2.3.4:42
    The conntrack entry is deleted from table, as it has the NAT_CLASH
    bit set.

In case of a retransmit from ORIGINAL dir, all further packets will get
the DNAT transformation to 10.0.0.6.

I tried to come up with other solutions but they all have worse
problems.

Alternatives considered were:
1.  Confirm ct entries at allocation time, not in postrouting.
 a. will cause uneccesarry work when the skb that creates the
    conntrack is dropped by ruleset.
 b. in case nat is applied, ct entry would need to be moved in
    the table, which requires another spinlock pair to be taken.
 c. breaks the 'unconfirmed entry is private to cpu' assumption:
    we would need to guard all nfct->ext allocation requests with
    ct->lock spinlock.

2. Make the unconfirmed list a hash table instead of a pcpu list.
   Shares drawback c) of the first alternative.

3. Document this is expected and force users to rearrange their
   ruleset (e.g. by using "-m cluster" instead of "-m statistics").
   nft has the 'jhash' expression which can be used instead of 'numgen'.

   Major drawback: doesn't fix what I consider a bug, not very realistic
   and I believe its reasonable to have the existing rulesets to 'just
   work'.

4. Document this is expected and force users to steer problematic
   packets to the same CPU -- this would serialize the "allocate new
   conntrack entry/nat table evaluation/perform nat/confirm entry", so
   no race can occur.  Similar drawback to 3.

Another advantage of this patch compared to 1) and 2) is that there are
no changes to the hot path; things are handled in the udp tracker and
the clash resolution path.

Cc: rcu@vger.kernel.org
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Josh Triplett <josh@joshtriplett.org>
Cc: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-17 10:55:14 +01:00
Randy Dunlap
8955b4357d skbuff: remove stale bit mask comments
Remove stale comments since this flag is no longer a bit mask
but is a bit field.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:49:51 -08:00
Matthieu Baerts
357b41caf9 mptcp: select CRYPTO
Without this modification and if CRYPTO is not selected, we have this
warning:

  WARNING: unmet direct dependencies detected for CRYPTO_LIB_SHA256
    Depends on [n]: CRYPTO [=n]
    Selected by [y]:
    - MPTCP [=y] && NET [=y] && INET [=y]

MPTCP selects CRYPTO_LIB_SHA256 which seems to depend on CRYPTO. CRYPTO
is now selected to avoid this issue.

Even though the config system prints that warning, it looks like
sha256.c is compiled and linked even without CONFIG_CRYPTO. Since MPTCP
will end up needing CONFIG_CRYPTO anyway in future commits -- currently
in preparation for net-next -- we propose to add it now to fix the
warning.

The dependency in the config system comes from the fact that
CRYPTO_LIB_SHA256 is defined in "lib/crypto/Kconfig" which is sourced
from "crypto/Kconfig" only if CRYPTO is selected.

Fixes: 65492c5a6a (mptcp: move from sha1 (v0) to sha256 (v1))
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:37:16 -08:00
Matteo Croce
744676e777 openvswitch: add TTL decrement action
New action to decrement TTL instead of setting it to a fixed value.
This action will decrement the TTL and, in case of expired TTL, drop it
or execute an action passed via a nested attribute.
The default TTL expired action is to drop the packet.

Supports both IPv4 and IPv6 via the ttl and hop_limit fields, respectively.

Tested with a corresponding change in the userspace:

    # ovs-dpctl dump-flows
    in_port(2),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},1
    in_port(1),eth(),eth_type(0x0800), packets:0, bytes:0, used:never, actions:dec_ttl{ttl<=1 action:(drop)},2
    in_port(1),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:2
    in_port(2),eth(),eth_type(0x0806), packets:0, bytes:0, used:never, actions:1

    # ping -c1 192.168.0.2 -t 42
    IP (tos 0x0, ttl 41, id 61647, offset 0, flags [DF], proto ICMP (1), length 84)
        192.168.0.1 > 192.168.0.2: ICMP echo request, id 386, seq 1, length 64
    # ping -c1 192.168.0.2 -t 120
    IP (tos 0x0, ttl 119, id 62070, offset 0, flags [DF], proto ICMP (1), length 84)
        192.168.0.1 > 192.168.0.2: ICMP echo request, id 388, seq 1, length 64
    # ping -c1 192.168.0.2 -t 1
    #

Co-developed-by: Bindiya Kurle <bindiyakurle@gmail.com>
Signed-off-by: Bindiya Kurle <bindiyakurle@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:34:44 -08:00
Taehee Yoo
7151affeef net: export netdev_next_lower_dev_rcu()
netdev_next_lower_dev_rcu() will be used to implement a function,
which is to walk all lower interfaces.
There are already functions that they walk their lower interface.
(netdev_walk_all_lower_dev_rcu, netdev_walk_all_lower_dev()).
But, there would be cases that couldn't be covered by given
netdev_walk_all_lower_dev_{rcu}() function.
So, some modules would want to implement own function,
which is to walk all lower interfaces.

In the next patch, netdev_next_lower_dev_rcu() will be used.
In addition, this patch removes two unused prototypes in netdevice.h.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:32:11 -08:00
Michal Kubecek
6699170376 ethtool: fix application of verbose no_mask bitset
A bitset without mask in a _SET request means we want exactly the bits in
the bitset to be set. This works correctly for compact format but when
verbose format is parsed, ethnl_update_bitset32_verbose() only sets the
bits present in the request bitset but does not clear the rest. This can
cause incorrect results like

  lion:~ # ethtool eth0 | grep Wake
          Supports Wake-on: pumbg
          Wake-on: g
  lion:~ # ethtool -s eth0 wol u
  lion:~ # ethtool eth0 | grep Wake
          Supports Wake-on: pumbg
          Wake-on: ug

when the second ethtool command issues request

ETHTOOL_MSG_WOL_SET
    ETHTOOL_A_WOL_HEADER
        ETHTOOL_A_HEADER_DEV_NAME = "eth0"
    ETHTOOL_A_WOL_MODES
        ETHTOOL_A_BITSET_NOMASK
        ETHTOOL_A_BITSET_BITS
            ETHTOOL_A_BITSET_BITS_BIT
                ETHTOOL_BITSET_BIT_INDEX = 1

Fix the logic by clearing the whole target bitmap before we start iterating
through the request bits.

Fixes: 10b518d4e6 ("ethtool: netlink bitset handling")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:28:51 -08:00
Arjun Roy
33946518d4 tcp-zerocopy: Return sk_err (if set) along with tcp receive zerocopy.
This patchset is intended to reduce the number of extra system calls
imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
this patchset has demonstrated a system call reduction of about 30%
when coupled with userspace changes.

For applications using epoll, returning sk_err along with the result
of tcp receive zerocopy could remove the need to call
recvmsg()=-EAGAIN after a spurious wakeup.

Consider a multi-threaded application using epoll. A thread may awaken
with EPOLLIN but another thread may already be reading. The
spuriously-awoken thread does not necessarily know that another thread
'won'; rather, it may be possible that it was woken up due to the
presence of an error if there is no data. A zerocopy read receiving 0
bytes thus would need to be followed up by recvmsg to be sure.

Instead, we return sk_err directly with zerocopy, so the application
can avoid this extra system call.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:25:02 -08:00
Arjun Roy
c8856c0514 tcp-zerocopy: Return inq along with tcp receive zerocopy.
This patchset is intended to reduce the number of extra system calls
imposed by TCP receive zerocopy. For ping-pong RPC style workloads,
this patchset has demonstrated a system call reduction of about 30%
when coupled with userspace changes.

For applications using edge-triggered epoll, returning inq along with
the result of tcp receive zerocopy could remove the need to call
recvmsg()=-EAGAIN after a successful zerocopy. Generally speaking,
since normally we would need to perform a recvmsg() call for every
successful small RPC read via TCP receive zerocopy, returning inq can
reduce the number of system calls performed by approximately half.

Signed-off-by: Arjun Roy <arjunroy@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:25:02 -08:00
Mat Martineau
b6e4a1aeeb mptcp: Protect subflow socket options before connection completes
Userspace should not be able to directly manipulate subflow socket
options before a connection is established since it is not yet known if
it will be an MPTCP subflow or a TCP fallback subflow. TCP fallback
subflows can be more directly controlled by userspace because they are
regular TCP connections, while MPTCP subflow sockets need to be
configured for the specific needs of MPTCP. Use the same logic as
sendmsg/recvmsg to ensure that socket option calls are only passed
through to known TCP fallback subflows.

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:20:18 -08:00
Eric Dumazet
e08ad80551 net: add strict checks in netdev_name_node_alt_destroy()
netdev_name_node_alt_destroy() does a lookup over all
device names of a namespace.

We need to make sure the name belongs to the device
of interest, and that we do not destroy its primary
name, since we rely on it being not deleted :
dev->name_node would indeed point to freed memory.

syzbot report was the following :

BUG: KASAN: use-after-free in dev_net include/linux/netdevice.h:2206 [inline]
BUG: KASAN: use-after-free in mld_force_mld_version net/ipv6/mcast.c:1172 [inline]
BUG: KASAN: use-after-free in mld_in_v2_mode_only net/ipv6/mcast.c:1180 [inline]
BUG: KASAN: use-after-free in mld_in_v1_mode+0x203/0x230 net/ipv6/mcast.c:1190
Read of size 8 at addr ffff88809886c588 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.6.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
 __kasan_report.cold+0x1b/0x32 mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:641
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:135
 dev_net include/linux/netdevice.h:2206 [inline]
 mld_force_mld_version net/ipv6/mcast.c:1172 [inline]
 mld_in_v2_mode_only net/ipv6/mcast.c:1180 [inline]
 mld_in_v1_mode+0x203/0x230 net/ipv6/mcast.c:1190
 mld_send_initial_cr net/ipv6/mcast.c:2083 [inline]
 mld_dad_timer_expire+0x24/0x230 net/ipv6/mcast.c:2118
 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
 expire_timers kernel/time/timer.c:1449 [inline]
 __run_timers kernel/time/timer.c:1773 [inline]
 __run_timers kernel/time/timer.c:1740 [inline]
 run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0x19b/0x1e0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:546 [inline]
 smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1146
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
 </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: 68 73 c5 f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 94 be 59 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 84 be 59 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 de 2a 74 f9 e8 09
RSP: 0018:ffffc90000d3fd68 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff136761a RBX: ffff8880a99fc340 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a99fcbd4
RBP: ffffc90000d3fd98 R08: ffff8880a99fc340 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff8aa5a1c0 R14: 0000000000000000 R15: 0000000000000001
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:686
 default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
 start_secondary+0x2f4/0x410 arch/x86/kernel/smpboot.c:264
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242

Allocated by task 10229:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 __kasan_kmalloc mm/kasan/common.c:515 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:488
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:529
 __do_kmalloc_node mm/slab.c:3616 [inline]
 __kmalloc_node+0x4e/0x70 mm/slab.c:3623
 kmalloc_node include/linux/slab.h:578 [inline]
 kvmalloc_node+0x68/0x100 mm/util.c:574
 kvmalloc include/linux/mm.h:645 [inline]
 kvzalloc include/linux/mm.h:653 [inline]
 alloc_netdev_mqs+0x98/0xe40 net/core/dev.c:9797
 rtnl_create_link+0x22d/0xaf0 net/core/rtnetlink.c:3047
 __rtnl_newlink+0xf9f/0x1790 net/core/rtnetlink.c:3309
 rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
 rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 __sys_sendto+0x262/0x380 net/socket.c:1998
 __do_compat_sys_socketcall net/compat.c:771 [inline]
 __se_compat_sys_socketcall net/compat.c:719 [inline]
 __ia32_compat_sys_socketcall+0x530/0x710 net/compat.c:719
 do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
 do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

Freed by task 10229:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 kasan_set_free_info mm/kasan/common.c:337 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:476
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:485
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x10a/0x2c0 mm/slab.c:3757
 __netdev_name_node_alt_destroy+0x1ff/0x2a0 net/core/dev.c:322
 netdev_name_node_alt_destroy+0x57/0x80 net/core/dev.c:334
 rtnl_alt_ifname net/core/rtnetlink.c:3518 [inline]
 rtnl_linkprop.isra.0+0x575/0x6f0 net/core/rtnetlink.c:3567
 rtnl_dellinkprop+0x46/0x60 net/core/rtnetlink.c:3588
 rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 ____sys_sendmsg+0x753/0x880 net/socket.c:2343
 ___sys_sendmsg+0x100/0x170 net/socket.c:2397
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
 __compat_sys_sendmsg net/compat.c:642 [inline]
 __do_compat_sys_sendmsg net/compat.c:649 [inline]
 __se_compat_sys_sendmsg net/compat.c:646 [inline]
 __ia32_compat_sys_sendmsg+0x7a/0xb0 net/compat.c:646
 do_syscall_32_irqs_on arch/x86/entry/common.c:337 [inline]
 do_fast_syscall_32+0x27b/0xe16 arch/x86/entry/common.c:408
 entry_SYSENTER_compat+0x70/0x7f arch/x86/entry/entry_64_compat.S:139

The buggy address belongs to the object at ffff88809886c000
 which belongs to the cache kmalloc-4k of size 4096
The buggy address is located 1416 bytes inside of
 4096-byte region [ffff88809886c000, ffff88809886d000)
The buggy address belongs to the page:
page:ffffea0002621b00 refcount:1 mapcount:0 mapping:ffff8880aa402000 index:0x0 compound_mapcount: 0
flags: 0xfffe0000010200(slab|head)
raw: 00fffe0000010200 ffffea0002610d08 ffffea0002607608 ffff8880aa402000
raw: 0000000000000000 ffff88809886c000 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88809886c480: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88809886c500: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88809886c580: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                      ^
 ffff88809886c600: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88809886c680: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:04:33 -08:00
Sebastien Boeuf
df12eb6d6c net: virtio_vsock: Enhance connection semantics
Whenever the vsock backend on the host sends a packet through the RX
queue, it expects an answer on the TX queue. Unfortunately, there is one
case where the host side will hang waiting for the answer and might
effectively never recover if no timeout mechanism was implemented.

This issue happens when the guest side starts binding to the socket,
which insert a new bound socket into the list of already bound sockets.
At this time, we expect the guest to also start listening, which will
trigger the sk_state to move from TCP_CLOSE to TCP_LISTEN. The problem
occurs if the host side queued a RX packet and triggered an interrupt
right between the end of the binding process and the beginning of the
listening process. In this specific case, the function processing the
packet virtio_transport_recv_pkt() will find a bound socket, which means
it will hit the switch statement checking for the sk_state, but the
state won't be changed into TCP_LISTEN yet, which leads the code to pick
the default statement. This default statement will only free the buffer,
while it should also respond to the host side, by sending a packet on
its TX queue.

In order to simply fix this unfortunate chain of events, it is important
that in case the default statement is entered, and because at this stage
we know the host side is waiting for an answer, we must send back a
packet containing the operation VIRTIO_VSOCK_OP_RST.

One could say that a proper timeout mechanism on the host side will be
enough to avoid the backend to hang. But the point of this patch is to
ensure the normal use case will be provided with proper responsiveness
when it comes to establishing the connection.

Signed-off-by: Sebastien Boeuf <sebastien.boeuf@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:01:49 -08:00
David S. Miller
ddb535a6a0 A few big new things:
* 802.11 frame encapsulation offload support
  * more HE (802.11ax) support, including some for 6 GHz band
  * powersave in hwsim, for better testing
 
 Of course as usual there are various cleanups and small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5GggsACgkQB8qZga/f
 l8Qgdg//R42bSv94JYPcwZ5phgTgraCRZODWjBJq08n2T5m0EmEufgX79d9uEdgT
 u9npvn+ich5/VZhmuSbGrW9TT6/FPLAZyghV1fj79o971Qd7ky2Mp8G1fcTEbtDn
 IG2e9vauY9XDSb2O3wNj8dA8rAN/kLNmhsPqWxn2CgLPqjdbf+W15dvo4rnaL2gs
 ffGyE47dHuAFwCruyT8UPbw3iu4+tQhruN9eVg+UkU8rJGvEMqfrLK20zl1weIV9
 a7IuXdxacdsHO8Y+tl6GtvgOURQPpvf55+leLOUhcmHPJ3f/eAal6wmWRxDxs/qB
 IWSe8BC81cZZ5pYWk1A+0sXfJMlYjNsN0xw5SQRSrbgyb5saz8aLUIlHsOBM4iPH
 SwzCMN5A1GOPOUFsugzPwbiki9g6dh0/EC2NyXE4A26CAd967dVXTvTY5SMNgiB+
 bZaaUDaPQUm1jgDT5bLRhTipTHbekDkYzG/e+wNO+HKyStoEYM485MwY4MQCYzEh
 HKDmkAbFuCwEUeXXw1y8GybUknApCRru9FtY+oiN/+y/aESfB7HJfmDFFU/KYgPu
 HOuqJoNAxdMdycDCb24/cLjUiehzfM6sujwBxZOD5WHhAcXrBo5dGd6ibfurIrjj
 XI36/mwTiMtyyb0/5xM1AKvoic2j+a5YU3MB7KSc9TlaPa5j2NA=
 =CgmJ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A few big new things:
 * 802.11 frame encapsulation offload support
 * more HE (802.11ax) support, including some for 6 GHz band
 * powersave in hwsim, for better testing

Of course as usual there are various cleanups and small fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:00:22 -08:00
chenqiwu
1e5946f5f7 net: x25: convert to list_for_each_entry_safe()
Use list_for_each_entry_safe() instead of list_for_each_safe()
to simplify the code.

Signed-off-by: chenqiwu <chenqiwu@xiaomi.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:59:42 -08:00
Eric Dumazet
44bfa9c5e5 net: rtnetlink: fix bugs in rtnl_alt_ifname()
Since IFLA_ALT_IFNAME is an NLA_STRING, we have no
guarantee it is nul terminated.

We should use nla_strdup() instead of kstrdup(), since this
helper will make sure not accessing out-of-bounds data.

BUG: KMSAN: uninit-value in strlen+0x5e/0xa0 lib/string.c:535
CPU: 1 PID: 19157 Comm: syz-executor.5 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 strlen+0x5e/0xa0 lib/string.c:535
 kstrdup+0x7f/0x1a0 mm/util.c:59
 rtnl_alt_ifname net/core/rtnetlink.c:3495 [inline]
 rtnl_linkprop+0x85d/0xc00 net/core/rtnetlink.c:3553
 rtnl_newlinkprop+0x9d/0xb0 net/core/rtnetlink.c:3568
 rtnetlink_rcv_msg+0x1153/0x1570 net/core/rtnetlink.c:5424
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5442
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45b3b9
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ff1c7b1ac78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007ff1c7b1b6d4 RCX: 000000000045b3b9
RDX: 0000000000000000 RSI: 0000000020000040 RDI: 0000000000000003
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000009cb R14: 00000000004cb3dd R15: 000000000075bf2c

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
 slab_alloc_node mm/slub.c:2774 [inline]
 __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4382
 __kmalloc_reserve net/core/skbuff.c:141 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:209
 alloc_skb include/linux/skbuff.h:1049 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline]
 netlink_sendmsg+0x7d3/0x14d0 net/netlink/af_netlink.c:1892
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 36fbf1e52b ("net: rtnetlink: add linkprop commands to add and delete alternative ifnames")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:51:59 -08:00
Jethro Beekman
540e585a79 net: fib_rules: Correctly set table field when table number exceeds 8 bits
In 709772e6e0, RT_TABLE_COMPAT was added to
allow legacy software to deal with routing table numbers >= 256, but the
same change to FIB rule queries was overlooked.

Signed-off-by: Jethro Beekman <jethro@fortanix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:38:24 -08:00
Leon Romanovsky
0d4597c8c5 net/rds: Track user mapped pages through special API
Convert net/rds to use the newly introduces pin_user_pages() API,
which properly sets FOLL_PIN. Setting FOLL_PIN is now required for
code that requires tracking of pinned pages.

Note that this effectively changes the code's behavior: it now
ultimately calls set_page_dirty_lock(), instead of set_page_dirty().
This is probably more accurate.

As Christoph Hellwig put it, "set_page_dirty() is only safe if we are
dealing with a file backed page where we have reference on the inode it
hangs off." [1]

[1] https://lore.kernel.org/r/20190723153640.GB720@lst.de

Cc: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:37:09 -08:00
Benjamin Poirier
afecdb376b ipv6: Fix nlmsg_flags when splitting a multipath route
When splitting an RTA_MULTIPATH request into multiple routes and adding the
second and later components, we must not simply remove NLM_F_REPLACE but
instead replace it by NLM_F_CREATE. Otherwise, it may look like the netlink
message was malformed.

For example,
	ip route add 2001:db8::1/128 dev dummy0
	ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0 \
		nexthop via fe80::30:2 dev dummy0
results in the following warnings:
[ 1035.057019] IPv6: RTM_NEWROUTE with no NLM_F_CREATE or NLM_F_REPLACE
[ 1035.057517] IPv6: NLM_F_CREATE should be set when creating new route

This patch makes the nlmsg sequence look equivalent for __ip6_ins_rt() to
what it would get if the multipath route had been added in multiple netlink
operations:
	ip route add 2001:db8::1/128 dev dummy0
	ip route change 2001:db8::1/128 nexthop via fe80::30:1 dev dummy0
	ip route append 2001:db8::1/128 nexthop via fe80::30:2 dev dummy0

Fixes: 2759647247 ("ipv6: fix ECMP route replacement")
Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:34:31 -08:00
Benjamin Poirier
e404b8c7cf ipv6: Fix route replacement with dev-only route
After commit 2759647247 ("ipv6: fix ECMP route replacement") it is no
longer possible to replace an ECMP-able route by a non ECMP-able route.
For example,
	ip route add 2001:db8::1/128 via fe80::1 dev dummy0
	ip route replace 2001:db8::1/128 dev dummy0
does not work as expected.

Tweak the replacement logic so that point 3 in the log of the above commit
becomes:
3. If the new route is not ECMP-able, and no matching non-ECMP-able route
exists, replace matching ECMP-able route (if any) or add the new route.

We can now summarize the entire replace semantics to:
When doing a replace, prefer replacing a matching route of the same
"ECMP-able-ness" as the replace argument. If there is no such candidate,
fallback to the first route found.

Fixes: 2759647247 ("ipv6: fix ECMP route replacement")
Signed-off-by: Benjamin Poirier <bpoirier@cumulusnetworks.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 18:34:31 -08:00
Simon Wunderlich
edddb36644 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-02-16 16:00:02 +01:00
Linus Torvalds
829e694469 NFS Client Bugfixes for Linux 5.6-rc2
Stable Bugfixes:
 - Fix DMA scatter-gather list mapping imbalance
 
 Other Fixes:
 - Fix directory verifier races
 - Fix races between open and dentry revalidation
 - Fix revalidation of dentries with delegations
 - Fix "cachethis" setting for writes
 - Fix delegation and delegation cred pinning
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl5HEa4ACgkQ18tUv7Cl
 QOvwAA/9FAIwiE5UU+OrbpJ47IKFay2lfu6ywWVt0Q9i/tDeTRawa9aZDNKmKuYk
 rViVzRhIrmdkhpSVeTl3BdV/U9oXmNQ9ubFxWkQ0TI1j/tkTotT4dA0/kjeampZJ
 dZeW+NtRnFxYtfyV6fV/Bp/GB1RE9tAr1hTzHYEpNfnKxmAgkOeIXsVebuWKvUxz
 XCeRED30Ekt2n1JQqb1m3VAMpYmprF2i9TErAVR8ILt0dkkJD8nNlPogrZwMjzuD
 GtB7BSJ7rIDHQomuF3HVrz+SpMhM530mpsZiVGYCKjChJWaPm12+AZGPtv3UcpAf
 /M2VTa+z8MJDwirzjWogo3OvYAqZGI6KUG0o2t5dKmguQoi1RrDlPbFMLXlxow2e
 5gi/qKFbnBOt6iIV9FffMb50jpnShBE6hQS4/rb2inGmAKYgEXcL+ouBY6308soS
 fnWJSViTQfGbF9uVxGeOYl766zEqlNleclfjKBCNXMC3HOfIslIBpilMDwCoOHeB
 MqYSqmnwAH9ZrqjNmR3Qh2RlcOxN8CTUMhkUW2KLMAgnEAd9WINaYu4u2Y50Fe6j
 yDa6YEAmaNSGVYVofh8qA1R1Zi4tnKlg1/WFdy/SnEQKi8mQqXASGiVu2bXjhlF1
 22cHVXNIjp1PMijZKnJtimuxChygj9xxgzw4Kstd1WCLm2VirAc=
 =Ase9
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client bugfixes from Anna Schumaker:
 "The only stable fix this time is the DMA scatter-gather list bug fixed
  by Chuck.

  The rest fix up races and refcounting issues that have been found
  during testing.

  Stable fix:
   - fix DMA scatter-gather list mapping imbalance

  The rest:
   - fix directory verifier races
   - fix races between open and dentry revalidation
   - fix revalidation of dentries with delegations
   - fix "cachethis" setting for writes
   - fix delegation and delegation cred pinning"

* tag 'nfs-for-5.6-2' of git://git.linux-nfs.org/projects/anna/linux-nfs:
  NFSv4: Ensure the delegation cred is pinned when we call delegreturn
  NFSv4: Ensure the delegation is pinned in nfs_do_return_delegation()
  NFSv4.1 make cachethis=no for writes
  xprtrdma: Fix DMA scatter-gather list mapping imbalance
  NFSv4: Fix revalidation of dentries with delegations
  NFSv4: Fix races between open and dentry revalidation
  NFS: Fix up directory verifier races
2020-02-14 14:46:11 -08:00
Linus Torvalds
2019fc96af Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix interrupt name truncation in mv88e6xxx dsa driver, from Andrew
    Lunn.

 2) Process generic XDP even if SKB is cloned, from Toke Høiland-Jørgensen.

 3) Fix leak of kernel memory to userspace in smc, from Eric Dumazet.

 4) Add some missing netlink attribute validation to matchall and
    flower, from Davide Caratti.

 5) Send icmp responses properly when NAT has been applied to the frame
    before we get to the tunnel emitting the icmp, from Jason Donenfeld.

 6) Make sure there is enough SKB headroom when adding dsa tags for qca
    and ar9331. From Per Forlin.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
  netdevice.h: fix all kernel-doc and Sphinx warnings
  net: dsa: tag_ar9331: Make sure there is headroom for tag
  net: dsa: tag_qca: Make sure there is headroom for tag
  net, ip6_tunnel: enhance tunnel locate with link check
  net/smc: no peer ID in CLC decline for SMCD
  net/smc: transfer fasync_list in case of fallback
  net: hns3: fix a copying IPv6 address error in hclge_fd_get_flow_tuples()
  net: hns3: fix VF bandwidth does not take effect in some case
  net: hns3: add management table after IMP reset
  mac80211: fix wrong 160/80+80 MHz setting
  cfg80211: add missing policy for NL80211_ATTR_STATUS_CODE
  xfrm: interface: use icmp_ndo_send helper
  wireguard: device: use icmp_ndo_send helper
  sunvnet: use icmp_ndo_send helper
  gtp: use icmp_ndo_send helper
  icmp: introduce helper for nat'd source address in network device context
  net/sched: flower: add missing validation of TCA_FLOWER_FLAGS
  net/sched: matchall: add missing validation of TCA_MATCHALL_FLAGS
  net/flow_dissector: remove unexist field description
  page_pool: refill page when alloc.count of pool is zero
  ...
2020-02-14 12:40:38 -08:00
Luiz Augusto von Dentz
eab2404ba7 Bluetooth: Add BT_PHY socket option
This adds BT_PHY socket option (read-only) which can be used to read
the PHYs in use by the underline connection.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-14 19:15:30 +01:00
Per Forlin
ddc9abaf5d net: dsa: tag_ar9331: Make sure there is headroom for tag
Passing tag size to skb_cow_head will make sure
there is enough headroom for the tag data.
This change does not introduce any overhead in case there
is already available headroom for tag.

Signed-off-by: Per Forlin <perfn@axis.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:34:51 -08:00
Per Forlin
04fb91243a net: dsa: tag_qca: Make sure there is headroom for tag
Passing tag size to skb_cow_head will make sure
there is enough headroom for the tag data.
This change does not introduce any overhead in case there
is already available headroom for tag.

Signed-off-by: Per Forlin <perfn@axis.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:34:51 -08:00
William Dauchy
5fdcce211b net, ip6_tunnel: enhance tunnel locate with link check
With ipip, it is possible to create an extra interface explicitly
attached to a given physical interface:

  # ip link show tunl0
  4: tunl0@NONE: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0
  # ip link add tunl1 type ipip dev eth0
  # ip link show tunl1
  6: tunl1@eth0: <NOARP> mtu 1480 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0

But it is not possible with ip6tnl:

  # ip link show ip6tnl0
  5: ip6tnl0@NONE: <NOARP> mtu 1452 qdisc noop state DOWN mode DEFAULT group default qlen 1000
      link/tunnel6 :: brd ::
  # ip link add ip6tnl1 type ip6tnl dev eth0
  RTNETLINK answers: File exists

This patch aims to make it possible by adding link comparaison in both
tunnel locate and lookup functions; we also modify mtu calculation when
attached to an interface with a lower mtu.

This permits to make use of x-netns communication by moving the newly
created tunnel in a given netns.

Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Reviewed-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:31:48 -08:00
David S. Miller
b32cb6fcf1 Just a few fixes:
* avoid running out of tracking space for frames that need
    to be reported to userspace by using more bits
  * fix beacon handling suppression by adding some relevant
    elements to the CRC calculation
  * fix quiet mode in action frames
  * fix crash in ethtool for virt_wifi and similar
  * add a missing policy entry
  * fix 160 & 80+80 bandwidth to take local capabilities into
    account
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5GZQ8ACgkQB8qZga/f
 l8TQYQ//fcrDPndtwJWE5IaXJq7OmnM2NsYED6r1gVK7CWeVhqfwES6x9LVbhW/B
 QPwqFfL3i2II8XFVG/+BCUcEhp5FZQKGEDXaYcWXqQCuc6mU7ZKBuaEJmznMoez1
 6b96vdkVa78QnNK0QjSjAMua9KaBCIS2bPP/0E5o/8T9YsRKhuC4ItpLHYAGLsfM
 1QzU8ZBr/Ug4XIx28lDYOs8MHF5jm8dUgSpB4WgSQUKGTmgvyO/SefGtaJQ36R9B
 tAVR3Mmstm00ufsXT3OhiqXgN6bjUENvM+osVjD6E1eqjtPlrUYTCO8b/ZC1wean
 4G077eE186uao+9xQ8cLk7sL6ZJnlRwJ15RdBB5bsLZ8nG2FpclKCMxpcxc4PL/M
 oIXjEU8uNhSAVDmI3c2k/sxRW0hdQoblgl8PQMZ3k/6nvpwnIrG8yoeAwP4B457Y
 vgiRE212YkOaXhtwrg1EuOIhk1N3YRG8DpSdDrKlj+y0LOijTrHCnqoUwnne+RpX
 2Zqm5Sz3R2Nel9iIVs1JOxIGvyg+KIoM+WjmmrjMSqlrvMMisq8O3BuMfgd6jd+t
 NuZE9MkukBPMiaB4QDMDgBcaTMrLRItD+SSSQcMvkORTfGrpjUhpiQTVwOPXCYqZ
 793BU8kI+gwrz9UfloYxSv2+2H12a6LDQY6Q/mJVKAuhdw6SXsE=
 =L+Ik
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a few fixes:
 * avoid running out of tracking space for frames that need
   to be reported to userspace by using more bits
 * fix beacon handling suppression by adding some relevant
   elements to the CRC calculation
 * fix quiet mode in action frames
 * fix crash in ethtool for virt_wifi and similar
 * add a missing policy entry
 * fix 160 & 80+80 bandwidth to take local capabilities into
   account
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:16:08 -08:00
Ursula Braun
369537c970 net/smc: no peer ID in CLC decline for SMCD
Just SMCR requires a CLC Peer ID, but not SMCD. The field should be
zero for SMCD.

Fixes: c758dfddc1 ("net/smc: add SMC-D support in CLC messages")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:09:36 -08:00
Ursula Braun
67f562e3e1 net/smc: transfer fasync_list in case of fallback
SMC does not work together with FASTOPEN. If sendmsg() is called with
flag MSG_FASTOPEN in SMC_INIT state, the SMC-socket switches to
fallback mode. To handle the previous ioctl FIOASYNC call correctly
in this case, it is necessary to transfer the socket wait queue
fasync_list to the internal TCP socket.

Reported-by: syzbot+4b1fe8105f8044a26162@syzkaller.appspotmail.com
Fixes: ee9dfbef02 ("net/smc: handle sockopts forcing fallback")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:09:36 -08:00
Howard Chung
cee5f20fec Bluetooth: secure bluetooth stack from bluedump attack
Attack scenario:
1. A Chromebook (let's call this device A) is paired to a legitimate
   Bluetooth classic device (e.g. a speaker) (let's call this device
   B).
2. A malicious device (let's call this device C) pretends to be the
   Bluetooth speaker by using the same BT address.
3. If device A is not currently connected to device B, device A will
   be ready to accept connection from device B in the background
   (technically, doing Page Scan).
4. Therefore, device C can initiate connection to device A
   (because device A is doing Page Scan) and device A will accept the
   connection because device A trusts device C's address which is the
   same as device B's address.
5. Device C won't be able to communicate at any high level Bluetooth
   profile with device A because device A enforces that device C is
   encrypted with their common Link Key, which device C doesn't have.
   But device C can initiate pairing with device A with just-works
   model without requiring user interaction (there is only pairing
   notification). After pairing, device A now trusts device C with a
   new different link key, common between device A and C.
6. From now on, device A trusts device C, so device C can at anytime
   connect to device A to do any kind of high-level hijacking, e.g.
   speaker hijack or mouse/keyboard hijack.

Since we don't know whether the repairing is legitimate or not,
leave the decision to user space if all the conditions below are met.
- the pairing is initialized by peer
- the authorization method is just-work
- host already had the link key to the peer

Signed-off-by: Howard Chung <howardchung@google.com>
Acked-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-14 16:01:00 +01:00
John Crispin
1f6e0baa70 mac80211: allow setting queue_len for drivers not using wake_tx_queue
Currently a mac80211 driver can only set the txq_limit when using
wake_tx_queue. Not all drivers use wake_tx_queue. This patch adds a new
element to wiphy allowing a driver to set a custom tx_queue_len and the
code that will apply it in case it is set. The current default is
1000 which is too low for ath11k when doing HE rates.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200211122605.13002-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:59:35 +01:00
Ben Greear
db6d9e9e8b mac80211: Fix setting txpower to zero
With multiple VIFS ath10k, and probably others, tries to find the
minimum txpower for all vifs and uses that when setting txpower in
the firmware.

If a second vif is added and starts to scan, it's txpower is not
initialized yet and it set to zero.

ath10k had a patch to ignore zero values, but then it is impossible
to actually set txpower to zero.

So, instead initialize the txpower to INT_MIN in mac80211, and let
drivers know that means the power has not been set and so should
be ignored.

This should fix regression in:

commit 88407beb1b
Author: Ryan Hsu <ryanhsu@qca.qualcomm.com>
Date:   Tue Dec 13 14:55:19 2016 -0800

    ath10k: fix incorrect txpower set by P2P_DEVICE interface

Tested on ath10k 9984 with ath10k-ct firmware.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20191217183057.24586-1-greearb@candelatech.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:57:00 +01:00
Shay Bar
33181ea7f5 mac80211: fix wrong 160/80+80 MHz setting
Before this patch, STA's would set new width of 160/80+80 MHz based on AP capability only.
This is wrong because STA may not support > 80MHz BW.
Fix is to verify STA has 160/80+80 MHz capability before increasing its width to > 80MHz.

The "support_80_80" and "support_160" setting is based on:
"Table 9-272 — Setting of the Supported Channel Width Set subfield and Extended NSS BW
Support subfield at a STA transmitting the VHT Capabilities Information field"
From "Draft P802.11REVmd_D3.0.pdf"

Signed-off-by: Aviad Brikman <aviad.brikman@celeno.com>
Signed-off-by: Shay Bar <shay.bar@celeno.com>
Link: https://lore.kernel.org/r/20200210130728.23674-1-shay.bar@celeno.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:54:12 +01:00
Sergey Matyukevich
ea75080110 cfg80211: add missing policy for NL80211_ATTR_STATUS_CODE
The nl80211_policy is missing for NL80211_ATTR_STATUS_CODE attribute.
As a result, for strictly validated commands, it's assumed to not be
supported.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200213131608.10541-2-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:50:37 +01:00
Jason A. Donenfeld
45942ba890 xfrm: interface: use icmp_ndo_send helper
Because xfrmi is calling icmp from network device context, it should use
the ndo helper so that the rate limiting applies correctly.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:19:00 -08:00
Jason A. Donenfeld
0b41713b60 icmp: introduce helper for nat'd source address in network device context
This introduces a helper function to be called only by network drivers
that wraps calls to icmp[v6]_send in a conntrack transformation, in case
NAT has been used. We don't want to pollute the non-driver path, though,
so we introduce this as a helper to be called by places that actually
make use of this, as suggested by Florian.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:19:00 -08:00
Davide Caratti
e2debf0852 net/sched: flower: add missing validation of TCA_FLOWER_FLAGS
unlike other classifiers that can be offloaded (i.e. users can set flags
like 'skip_hw' and 'skip_sw'), 'cls_flower' doesn't validate the size of
netlink attribute 'TCA_FLOWER_FLAGS' provided by user: add a proper entry
to fl_policy.

Fixes: 5b33f48842 ("net/flower: Introduce hardware offload support")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:16:35 -08:00
Davide Caratti
1afa3cc90f net/sched: matchall: add missing validation of TCA_MATCHALL_FLAGS
unlike other classifiers that can be offloaded (i.e. users can set flags
like 'skip_hw' and 'skip_sw'), 'cls_matchall' doesn't validate the size
of netlink attribute 'TCA_MATCHALL_FLAGS' provided by user: add a proper
entry to mall_policy.

Fixes: b87f7936a9 ("net/sched: Add match-all classifier hw offloading.")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:16:35 -08:00
Li RongQing
304db6cb76 page_pool: refill page when alloc.count of pool is zero
"do {} while" in page_pool_refill_alloc_cache will always
refill page once whether refill is true or false, and whether
alloc.count of pool is less than PP_ALLOC_CACHE_REFILL or not
this is wrong, and will cause overflow of pool->alloc.cache

the caller of __page_pool_get_cached should provide guarantee
that pool->alloc.cache is safe to access, so in_serving_softirq
should be removed as suggested by Jesper Dangaard Brouer in
https://patchwork.ozlabs.org/patch/1233713/

so fix this issue by calling page_pool_refill_alloc_cache()
only when pool->alloc.count is zero

Fixes: 44768decb7 ("page_pool: handle page recycle for NUMA_NO_NODE condition")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:11:51 -08:00
Chuck Lever
ca1c671302 xprtrdma: Fix DMA scatter-gather list mapping imbalance
The @nents value that was passed to ib_dma_map_sg() has to be passed
to the matching ib_dma_unmap_sg() call. If ib_dma_map_sg() choses to
concatenate sg entries, it will return a different nents value than
it was passed.

The bug was exposed by recent changes to the AMD IOMMU driver, which
enabled sg entry concatenation.

Looking all the way back to commit 4143f34e01 ("xprtrdma: Port to
new memory registration API") and reviewing other kernel ULPs, it's
not clear that the frwr_map() logic was ever correct for this case.

Reported-by: Andre Tomt <andre@tomt.net>
Suggested-by: Robin Murphy <robin.murphy@arm.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Cc: stable@vger.kernel.org
Reviewed-by: Jason Gunthorpe <jgg@mellanox.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-13 15:35:33 -05:00
Xin Long
a1a7e3a36e xfrm: add the missing verify_sec_ctx_len check in xfrm_add_acquire
Without doing verify_sec_ctx_len() check in xfrm_add_acquire(), it may be
out-of-bounds to access uctx->ctx_str with uctx->ctx_len, as noticed by
syz:

  BUG: KASAN: slab-out-of-bounds in selinux_xfrm_alloc_user+0x237/0x430
  Read of size 768 at addr ffff8880123be9b4 by task syz-executor.1/11650

  Call Trace:
   dump_stack+0xe8/0x16e
   print_address_description.cold.3+0x9/0x23b
   kasan_report.cold.4+0x64/0x95
   memcpy+0x1f/0x50
   selinux_xfrm_alloc_user+0x237/0x430
   security_xfrm_policy_alloc+0x5c/0xb0
   xfrm_policy_construct+0x2b1/0x650
   xfrm_add_acquire+0x21d/0xa10
   xfrm_user_rcv_msg+0x431/0x6f0
   netlink_rcv_skb+0x15a/0x410
   xfrm_netlink_rcv+0x6d/0x90
   netlink_unicast+0x50e/0x6a0
   netlink_sendmsg+0x8ae/0xd40
   sock_sendmsg+0x133/0x170
   ___sys_sendmsg+0x834/0x9a0
   __sys_sendmsg+0x100/0x1e0
   do_syscall_64+0xe5/0x660
   entry_SYSCALL_64_after_hwframe+0x6a/0xdf

So fix it by adding the missing verify_sec_ctx_len check there.

Fixes: 980ebd2579 ("[IPSEC]: Sync series - acquire insert")
Reported-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-12 11:06:32 +01:00
Xin Long
171d449a02 xfrm: fix uctx len check in verify_sec_ctx_len
It's not sufficient to do 'uctx->len != (sizeof(struct xfrm_user_sec_ctx) +
uctx->ctx_len)' check only, as uctx->len may be greater than nla_len(rt),
in which case it will cause slab-out-of-bounds when accessing uctx->ctx_str
later.

This patch is to fix it by return -EINVAL when uctx->len > nla_len(rt).

Fixes: df71837d50 ("[LSM-IPSec]: Security association restriction.")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-12 11:06:32 +01:00
Eric Dumazet
457fed775c net/smc: fix leak of kernel memory to user space
As nlmsg_put() does not clear the memory that is reserved,
it this the caller responsability to make sure all of this
memory will be written, in order to not reveal prior content.

While we are at it, we can provide the socket cookie even
if clsock is not set.

syzbot reported :

BUG: KMSAN: uninit-value in __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
BUG: KMSAN: uninit-value in __fswab32 include/uapi/linux/swab.h:59 [inline]
BUG: KMSAN: uninit-value in __swab32p include/uapi/linux/swab.h:179 [inline]
BUG: KMSAN: uninit-value in __be32_to_cpup include/uapi/linux/byteorder/little_endian.h:82 [inline]
BUG: KMSAN: uninit-value in get_unaligned_be32 include/linux/unaligned/access_ok.h:30 [inline]
BUG: KMSAN: uninit-value in ____bpf_skb_load_helper_32 net/core/filter.c:240 [inline]
BUG: KMSAN: uninit-value in ____bpf_skb_load_helper_32_no_cache net/core/filter.c:255 [inline]
BUG: KMSAN: uninit-value in bpf_skb_load_helper_32_no_cache+0x14a/0x390 net/core/filter.c:252
CPU: 1 PID: 5262 Comm: syz-executor.5 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 __arch_swab32 arch/x86/include/uapi/asm/swab.h:10 [inline]
 __fswab32 include/uapi/linux/swab.h:59 [inline]
 __swab32p include/uapi/linux/swab.h:179 [inline]
 __be32_to_cpup include/uapi/linux/byteorder/little_endian.h:82 [inline]
 get_unaligned_be32 include/linux/unaligned/access_ok.h:30 [inline]
 ____bpf_skb_load_helper_32 net/core/filter.c:240 [inline]
 ____bpf_skb_load_helper_32_no_cache net/core/filter.c:255 [inline]
 bpf_skb_load_helper_32_no_cache+0x14a/0x390 net/core/filter.c:252

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_kmalloc_large+0x73/0xc0 mm/kmsan/kmsan_hooks.c:128
 kmalloc_large_node_hook mm/slub.c:1406 [inline]
 kmalloc_large_node+0x282/0x2c0 mm/slub.c:3841
 __kmalloc_node_track_caller+0x44b/0x1200 mm/slub.c:4368
 __kmalloc_reserve net/core/skbuff.c:141 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:209
 alloc_skb include/linux/skbuff.h:1049 [inline]
 netlink_dump+0x44b/0x1ab0 net/netlink/af_netlink.c:2224
 __netlink_dump_start+0xbb2/0xcf0 net/netlink/af_netlink.c:2352
 netlink_dump_start include/linux/netlink.h:233 [inline]
 smc_diag_handler_dump+0x2ba/0x300 net/smc/smc_diag.c:242
 sock_diag_rcv_msg+0x211/0x610 net/core/sock_diag.c:256
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
 sock_diag_rcv+0x63/0x80 net/core/sock_diag.c:275
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 kernel_sendmsg+0x433/0x440 net/socket.c:679
 sock_no_sendpage+0x235/0x300 net/core/sock.c:2740
 kernel_sendpage net/socket.c:3776 [inline]
 sock_sendpage+0x1e1/0x2c0 net/socket.c:937
 pipe_to_sendpage+0x38c/0x4c0 fs/splice.c:458
 splice_from_pipe_feed fs/splice.c:512 [inline]
 __splice_from_pipe+0x539/0xed0 fs/splice.c:636
 splice_from_pipe fs/splice.c:671 [inline]
 generic_splice_sendpage+0x1d5/0x2d0 fs/splice.c:844
 do_splice_from fs/splice.c:863 [inline]
 do_splice fs/splice.c:1170 [inline]
 __do_sys_splice fs/splice.c:1447 [inline]
 __se_sys_splice+0x2380/0x3350 fs/splice.c:1427
 __x64_sys_splice+0x6e/0x90 fs/splice.c:1427
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: f16a7dd5cf ("smc: netlink interface for SMC sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ursula Braun <ubraun@linux.vnet.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-11 17:04:42 -08:00
Toke Høiland-Jørgensen
ad1e03b2b3 core: Don't skip generic XDP program execution for cloned SKBs
The current generic XDP handler skips execution of XDP programs entirely if
an SKB is marked as cloned. This leads to some surprising behaviour, as
packets can end up being cloned in various ways, which will make an XDP
program not see all the traffic on an interface.

This was discovered by a simple test case where an XDP program that always
returns XDP_DROP is installed on a veth device. When combining this with
the Scapy packet sniffer (which uses an AF_PACKET) socket on the sending
side, SKBs reliably end up in the cloned state, causing them to be passed
through to the receiving interface instead of being dropped. A minimal
reproducer script for this is included below.

This patch fixed the issue by simply triggering the existing linearisation
code for cloned SKBs instead of skipping the XDP program execution. This
behaviour is in line with the behaviour of the native XDP implementation
for the veth driver, which will reallocate and copy the SKB data if the SKB
is marked as shared.

Reproducer Python script (requires BCC and Scapy):

from scapy.all import TCP, IP, Ether, sendp, sniff, AsyncSniffer, Raw, UDP
from bcc import BPF
import time, sys, subprocess, shlex

SKB_MODE = (1 << 1)
DRV_MODE = (1 << 2)
PYTHON=sys.executable

def client():
    time.sleep(2)
    # Sniffing on the sender causes skb_cloned() to be set
    s = AsyncSniffer()
    s.start()

    for p in range(10):
        sendp(Ether(dst="aa:aa:aa:aa:aa:aa", src="cc:cc:cc:cc:cc:cc")/IP()/UDP()/Raw("Test"),
              verbose=False)
        time.sleep(0.1)

    s.stop()
    return 0

def server(mode):
    prog = BPF(text="int dummy_drop(struct xdp_md *ctx) {return XDP_DROP;}")
    func = prog.load_func("dummy_drop", BPF.XDP)
    prog.attach_xdp("a_to_b", func, mode)

    time.sleep(1)

    s = sniff(iface="a_to_b", count=10, timeout=15)
    if len(s):
        print(f"Got {len(s)} packets - should have gotten 0")
        return 1
    else:
        print("Got no packets - as expected")
        return 0

if len(sys.argv) < 2:
    print(f"Usage: {sys.argv[0]} <skb|drv>")
    sys.exit(1)

if sys.argv[1] == "client":
    sys.exit(client())
elif sys.argv[1] == "server":
    mode = SKB_MODE if sys.argv[2] == 'skb' else DRV_MODE
    sys.exit(server(mode))
else:
    try:
        mode = sys.argv[1]
        if mode not in ('skb', 'drv'):
            print(f"Usage: {sys.argv[0]} <skb|drv>")
            sys.exit(1)
        print(f"Running in {mode} mode")

        for cmd in [
                'ip netns add netns_a',
                'ip netns add netns_b',
                'ip -n netns_a link add a_to_b type veth peer name b_to_a netns netns_b',
                # Disable ipv6 to make sure there's no address autoconf traffic
                'ip netns exec netns_a sysctl -qw net.ipv6.conf.a_to_b.disable_ipv6=1',
                'ip netns exec netns_b sysctl -qw net.ipv6.conf.b_to_a.disable_ipv6=1',
                'ip -n netns_a link set dev a_to_b address aa:aa:aa:aa:aa:aa',
                'ip -n netns_b link set dev b_to_a address cc:cc:cc:cc:cc:cc',
                'ip -n netns_a link set dev a_to_b up',
                'ip -n netns_b link set dev b_to_a up']:
            subprocess.check_call(shlex.split(cmd))

        server = subprocess.Popen(shlex.split(f"ip netns exec netns_a {PYTHON} {sys.argv[0]} server {mode}"))
        client = subprocess.Popen(shlex.split(f"ip netns exec netns_b {PYTHON} {sys.argv[0]} client"))

        client.wait()
        server.wait()
        sys.exit(server.returncode)

    finally:
        subprocess.run(shlex.split("ip netns delete netns_a"))
        subprocess.run(shlex.split("ip netns delete netns_b"))

Fixes: d445516966 ("net: xdp: support xdp generic on virtual devices")
Reported-by: Stepan Horacek <shoracek@redhat.com>
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-11 17:01:29 -08:00
Magnus Karlsson
30744a6862 xsk: Publish global consumer pointers when NAPI is finished
The commit 4b638f13ba ("xsk: Eliminate the RX batch size")
introduced a much more lazy way of updating the global consumer
pointers from the kernel side, by only doing so when running out of
entries in the fill or Tx rings (the rings consumed by the
kernel). This can result in a deadlock with the user application if
the kernel requires more than one entry to proceed and the application
cannot put these entries in the fill ring because the kernel has not
updated the global consumer pointer since the ring is not empty.

Fix this by publishing the local kernel side consumer pointer whenever
we have completed Rx or Tx processing in the kernel. This way, user
space will have an up-to-date view of the consumer pointers whenever it
gets to execute in the one core case (application and driver on the
same core), or after a certain number of packets have been processed
in the two core case (application and driver on different cores).

A side effect of this patch is that the one core case gets better
performance, but the two core case gets worse. The reason that the one
core case improves is that updating the global consumer pointer is
relatively cheap since the application by definition is not running
when the kernel is (they are on the same core) and it is beneficial
for the application, once it gets to run, to have pointers that are
as up to date as possible since it then can operate on more packets
and buffers. In the two core case, the most important performance
aspect is to minimize the number of accesses to the global pointers
since they are shared between two cores and bounces between the caches
of those cores. This patch results in more updates to global state,
which means lower performance in the two core case.

Fixes: 4b638f13ba ("xsk: Eliminate the RX batch size")
Reported-by: Ryan Goodfellow <rgoodfel@isi.edu>
Reported-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Link: https://lore.kernel.org/bpf/1581348432-6747-1-git-send-email-magnus.karlsson@intel.com
2020-02-11 15:51:11 +01:00
Florian Westphal
bb89abe52b netfilter: conntrack: split resolve_clash function
Followup patch will need a helper function with the 'clashing entries
refer to the identical tuple in both directions' resolution logic.

This patch will add another resolve_clash helper where loser_ct must
not be added to the dying list because it will be inserted into the
table.

Therefore this also moves the stat counters and dying-list insertion
of the losing ct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-11 11:45:01 +01:00
Florian Westphal
b1b32552c1 netfilter: conntrack: place confirm-bit setting in a helper
... so it can be re-used from clash resolution in followup patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-11 11:45:01 +01:00
Florian Westphal
3d1e0b406d netfilter: conntrack: remove two args from resolve_clash
ctinfo is whats taken from the skb, i.e.
ct = nf_ct_get(skb, &ctinfo).

We do not pass 'ct' and instead re-fetch it from the skb.
Just do the same for both netns and ctinfo.

Also add a comment on what clash resolution is supposed to do.
While at it, one indent level can be removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-11 11:45:01 +01:00
Tuong Lien
5391a87751 tipc: fix successful connect() but timed out
In commit 9546a0b7ce ("tipc: fix wrong connect() return code"), we
fixed the issue with the 'connect()' that returns zero even though the
connecting has failed by waiting for the connection to be 'ESTABLISHED'
really. However, the approach has one drawback in conjunction with our
'lightweight' connection setup mechanism that the following scenario
can happen:

          (server)                        (client)

   +- accept()|                      |             wait_for_conn()
   |          |                      |connect() -------+
   |          |<-------[SYN]---------|                 > sleeping
   |          |                      *CONNECTING       |
   |--------->*ESTABLISHED           |                 |
              |--------[ACK]-------->*ESTABLISHED      > wakeup()
        send()|--------[DATA]------->|\                > wakeup()
        send()|--------[DATA]------->| |               > wakeup()
          .   .          .           . |-> recvq       .
          .   .          .           . |               .
        send()|--------[DATA]------->|/                > wakeup()
       close()|--------[FIN]-------->*DISCONNECTING    |
              *DISCONNECTING         |                 |
              |                      ~~~~~~~~~~~~~~~~~~> schedule()
                                                       | wait again
                                                       .
                                                       .
                                                       | ETIMEDOUT

Upon the receipt of the server 'ACK', the client becomes 'ESTABLISHED'
and the 'wait_for_conn()' process is woken up but not run. Meanwhile,
the server starts to send a number of data following by a 'close()'
shortly without waiting any response from the client, which then forces
the client socket to be 'DISCONNECTING' immediately. When the wait
process is switched to be running, it continues to wait until the timer
expires because of the unexpected socket state. The client 'connect()'
will finally get ‘-ETIMEDOUT’ and force to release the socket whereas
there remains the messages in its receive queue.

Obviously the issue would not happen if the server had some delay prior
to its 'close()' (or the number of 'DATA' messages is large enough),
but any kind of delay would make the connection setup/shutdown "heavy".
We solve this by simply allowing the 'connect()' returns zero in this
particular case. The socket is already 'DISCONNECTING', so any further
write will get '-EPIPE' but the socket is still able to read the
messages existing in its receive queue.

Note: This solution doesn't break the previous one as it deals with a
different situation that the socket state is 'DISCONNECTING' but has no
error (i.e. sk->sk_err = 0).

Fixes: 9546a0b7ce ("tipc: fix wrong connect() return code")
Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-10 10:23:00 +01:00
Chen Wandun
5609e2bbef mptcp: make the symbol 'mptcp_sk_clone_lock' static
Fix the following sparse warning:
net/mptcp/protocol.c:646:13: warning: symbol 'mptcp_sk_clone_lock' was not declared. Should it be static?

Fixes: b0519de8b3 ("mptcp: fix use-after-free for ipv6")
Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-10 10:23:00 +01:00
Chen Wandun
2437fd7baf tipc: make three functions static
Fix the following sparse warning:

net/tipc/node.c:281:6: warning: symbol 'tipc_node_free' was not declared. Should it be static?
net/tipc/node.c:2801:5: warning: symbol '__tipc_nl_node_set_key' was not declared. Should it be static?
net/tipc/node.c:2878:5: warning: symbol '__tipc_nl_node_flush_key' was not declared. Should it be static?

Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Fixes: e1f32190cf ("tipc: add support for AEAD key setting via netlink")

Signed-off-by: Chen Wandun <chenwandun@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-10 10:23:00 +01:00
Linus Torvalds
89a47dd1af Kbuild updates for v5.6 (2nd)
- fix randconfig to generate a sane .config
 
  - rename hostprogs-y / always to hostprogs / always-y, which are
    more natual syntax.
 
  - optimize scripts/kallsyms
 
  - fix yes2modconfig and mod2yesconfig
 
  - make multiple directory targets ('make foo/ bar/') work
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl47NfMVHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGRGwP/3AHO8P0wGEeFKs3ziSMjs2W7/Pj
 lN08Kuxm0u3LnyEEcHVUveoi+xBYqvrw0RsGgYf5S8q0Mpep7MPqbfkDUxV/0Zkj
 QP2CsvOTbjdBjH7q3ojkwLcDl0Pxu9mg3eZMRXZ2WQeNXuMRw6Bicoh7ElvB1Bv/
 HC+j30i2Me3cf/riQGSAsstvlXyIR8RaerR8PfRGESTysiiN76+JcHTatJHhOJL9
 O6XKkzo8/CXMYKKVF4Ae4NP+WFg6E96/pAPx0Rf47RbPX9UG35L9rkzTDnk70Ms6
 OhKiu3hXsRX7mkqApuoTqjge4+iiQcKZxYmMXU1vGlIRzjwg19/4YFP6pDSCcnIu
 kKb8KN4o4N41N7MFS3OLZWwISA8Vw6RbtwDZ3AghDWb7EHb9oNW42mGfcAPr1+wZ
 /KH6RHTzaz+5q2MgyMY1NhADFrhIT9CvDM+UJECgbokblnw7PHAnPmbsuVak9ZOH
 u9ojO1HpTTuIYO6N6v4K5zQBZF1N+RvkmBnhHd8j6SksppsCoC/G62QxgXhF2YK3
 FQMpATCpuyengLxWAmPEjsyyPOlrrdu9UxqNsXVy5ol40+7zpxuHwKcQKCa9urJR
 rcpbIwLaBcLhHU4BmvBxUk5aZxxGV2F0O0gXTOAbT2xhd6BipZSMhUmN49SErhQm
 NC/coUmQX7McxMXh
 =sv4U
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull more Kbuild updates from Masahiro Yamada:

 - fix randconfig to generate a sane .config

 - rename hostprogs-y / always to hostprogs / always-y, which are more
   natual syntax.

 - optimize scripts/kallsyms

 - fix yes2modconfig and mod2yesconfig

 - make multiple directory targets ('make foo/ bar/') work

* tag 'kbuild-v5.6-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild:
  kbuild: make multiple directory targets work
  kconfig: Invalidate all symbols after changing to y or m.
  kallsyms: fix type of kallsyms_token_table[]
  scripts/kallsyms: change table to store (strcut sym_entry *)
  scripts/kallsyms: rename local variables in read_symbol()
  kbuild: rename hostprogs-y/always to hostprogs/always-y
  kbuild: fix the document to use extra-y for vmlinux.lds
  kconfig: fix broken dependency in randconfig-generated .config
2020-02-09 16:05:50 -08:00
Linus Torvalds
291abfea47 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Unbalanced locking in mwifiex_process_country_ie, from Brian Norris.

 2) Fix thermal zone registration in iwlwifi, from Andrei
    Otcheretianski.

 3) Fix double free_irq in sgi ioc3 eth, from Thomas Bogendoerfer.

 4) Use after free in mptcp, from Florian Westphal.

 5) Use after free in wireguard's root_remove_peer_lists, from Eric
    Dumazet.

 6) Properly access packets heads in bonding alb code, from Eric
    Dumazet.

 7) Fix data race in skb_queue_len(), from Qian Cai.

 8) Fix regression in r8169 on some chips, from Heiner Kallweit.

 9) Fix XDP program ref counting in hv_netvsc, from Haiyang Zhang.

10) Certain kinds of set link netlink operations can cause a NULL deref
    in the ipv6 addrconf code. Fix from Eric Dumazet.

11) Don't cancel uninitialized work queue in drop monitor, from Ido
    Schimmel.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (84 commits)
  net: thunderx: use proper interface type for RGMII
  mt76: mt7615: fix max_nss in mt7615_eeprom_parse_hw_cap
  bpf: Improve bucket_log calculation logic
  selftests/bpf: Test freeing sockmap/sockhash with a socket in it
  bpf, sockhash: Synchronize_rcu before free'ing map
  bpf, sockmap: Don't sleep while holding RCU lock on tear-down
  bpftool: Don't crash on missing xlated program instructions
  bpf, sockmap: Check update requirements after locking
  drop_monitor: Do not cancel uninitialized work item
  mlxsw: spectrum_dpipe: Add missing error path
  mlxsw: core: Add validation of hardware device types for MGPIR register
  mlxsw: spectrum_router: Clear offload indication from IPv6 nexthops on abort
  selftests: mlxsw: Add test cases for local table route replacement
  mlxsw: spectrum_router: Prevent incorrect replacement of local table routes
  net: dsa: microchip: enable module autoprobe
  ipv6/addrconf: fix potential NULL deref in inet6_set_link_af()
  dpaa_eth: support all modes with rate adapting PHYs
  net: stmmac: update pci platform data to use phy_interface
  net: stmmac: xgmac: fix missing IFF_MULTICAST checki in dwxgmac2_set_filter
  net: stmmac: fix missing IFF_MULTICAST check in dwmac4_set_filter
  ...
2020-02-08 17:15:08 -08:00
Linus Torvalds
c9d35ee049 Merge branch 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs file system parameter updates from Al Viro:
 "Saner fs_parser.c guts and data structures. The system-wide registry
  of syntax types (string/enum/int32/oct32/.../etc.) is gone and so is
  the horror switch() in fs_parse() that would have to grow another case
  every time something got added to that system-wide registry.

  New syntax types can be added by filesystems easily now, and their
  namespace is that of functions - not of system-wide enum members. IOW,
  they can be shared or kept private and if some turn out to be widely
  useful, we can make them common library helpers, etc., without having
  to do anything whatsoever to fs_parse() itself.

  And we already get that kind of requests - the thing that finally
  pushed me into doing that was "oh, and let's add one for timeouts -
  things like 15s or 2h". If some filesystem really wants that, let them
  do it. Without somebody having to play gatekeeper for the variants
  blessed by direct support in fs_parse(), TYVM.

  Quite a bit of boilerplate is gone. And IMO the data structures make a
  lot more sense now. -200LoC, while we are at it"

* 'merge.nfs-fs_parse.1' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (25 commits)
  tmpfs: switch to use of invalfc()
  cgroup1: switch to use of errorfc() et.al.
  procfs: switch to use of invalfc()
  hugetlbfs: switch to use of invalfc()
  cramfs: switch to use of errofc() et.al.
  gfs2: switch to use of errorfc() et.al.
  fuse: switch to use errorfc() et.al.
  ceph: use errorfc() and friends instead of spelling the prefix out
  prefix-handling analogues of errorf() and friends
  turn fs_param_is_... into functions
  fs_parse: handle optional arguments sanely
  fs_parse: fold fs_parameter_desc/fs_parameter_spec
  fs_parser: remove fs_parameter_description name field
  add prefix to fs_context->log
  ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
  new primitive: __fs_parse()
  switch rbd and libceph to p_log-based primitives
  struct p_log, variants of warnf() et.al. taking that one instead
  teach logfc() to handle prefices, give it saner calling conventions
  get rid of cg_invalf()
  ...
2020-02-08 13:26:41 -08:00
David S. Miller
2696e1146d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-02-07

The following pull-request contains BPF updates for your *net* tree.

We've added 15 non-merge commits during the last 10 day(s) which contain
a total of 12 files changed, 114 insertions(+), 31 deletions(-).

The main changes are:

1) Various BPF sockmap fixes related to RCU handling in the map's tear-
   down code, from Jakub Sitnicki.

2) Fix macro state explosion in BPF sk_storage map when calculating its
   bucket_log on allocation, from Martin KaFai Lau.

3) Fix potential BPF sockmap update race by rechecking socket's established
   state under lock, from Lorenz Bauer.

4) Fix crash in bpftool on missing xlated instructions when kptr_restrict
   sysctl is set, from Toke Høiland-Jørgensen.

5) Fix i40e's XSK wakeup code to return proper error in busy state and
   various misc fixes in xdpsock BPF sample code, from Maciej Fijalkowski.

6) Fix the way modifiers are skipped in BTF in the verifier while walking
   pointers to avoid program rejection, from Alexei Starovoitov.

7) Fix Makefile for runqslower BPF tool to i) rebuild on libbpf changes and
   ii) to fix undefined reference linker errors for older gcc version due to
   order of passed gcc parameters, from Yulia Kartseva and Song Liu.

8) Fix a trampoline_count BPF kselftest warning about missing braces around
   initializer, from Andrii Nakryiko.

9) Fix up redundant "HAVE" prefix from large INSN limit kernel probe in
   bpftool, from Michal Rostecki.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-08 15:01:03 +01:00
Linus Torvalds
08dffcc7d9 Highlights:
- Server-to-server copy code from Olga.  To use it, client and
 	  both servers must have support, the target server must be able
 	  to access the source server over NFSv4.2, and the target
 	  server must have the inter_copy_offload_enable module
 	  parameter set.
 	- Improvements and bugfixes for the new filehandle cache,
 	  especially in the container case, from Trond
 	- Also from Trond, better reporting of write errors.
 	- Y2038 work from Arnd.
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCAAzFiEEYtFWavXG9hZotryuJ5vNeUKO4b4FAl490mAVHGJmaWVsZHNA
 ZmllbGRzZXMub3JnAAoJECebzXlCjuG+HkkP/33CsYXp0wvfNrfxCY3zHRxHpfw+
 T9Ownxxw0RAJc/dRluC/2PIKJ20uVqtLrplU63bMBqJn84WF7OALq9twZ79a3fVF
 mvdmnZbNq9B3ncKJlT7akkEelyJCRap7NgG/oTyubE8MlPl6gKpD8c+G7XdW/uN+
 r0fprQz4rW4CYCBGSHq7HusEKqY4Gw+gbyAfJ6A79TMjF1ei51PG+9c8rkIsI5CO
 1TQ3gY1gSJmGf2DoF86Q9WTVb+DvRTEs+t7QkxY/Vlo+QXY8CZyu+qSxN7i/F20m
 gv2GrSpQMS9DEK/ZaG6cxaH+sM18Db4KLvcl3koL6lONHDR2OafSdKLyy0I60jhO
 WfDSHhfDCrAdASTjNlTPrjBrdK3gafiaJVL9vy901ZJjPaNb3EH0nMQ5bEvOBECq
 TCqPcQUcbku+qUVIcFwzSK1hXQFQHNh8WIuqXvNviZIzFDoipwsHVnQK02Owj89L
 R2tbZue1O8voacg/9xw3tWAT7pI+SaBb0EvJuqRxBshiZEU8kKKtMchOwSECRDcu
 k4lcqC5EFW7e4EzGlr6Wx8sI5lwCapva8ccjmPXX+R/vyM81oxWGB84GqWjjwubH
 3Fcok23F9rW2IQJkqgPlNj/9hAjTn2+vM13UbfMlnchGNsQ2gbkc5CDGC/J6Wwpo
 tHVristV9Gu5bJym
 =FxLY
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Highlights:

   - Server-to-server copy code from Olga.

     To use it, client and both servers must have support, the target
     server must be able to access the source server over NFSv4.2, and
     the target server must have the inter_copy_offload_enable module
     parameter set.

   - Improvements and bugfixes for the new filehandle cache, especially
     in the container case, from Trond

   - Also from Trond, better reporting of write errors.

   - Y2038 work from Arnd"

* tag 'nfsd-5.6' of git://linux-nfs.org/~bfields/linux: (55 commits)
  sunrpc: expiry_time should be seconds not timeval
  nfsd: make nfsd_filecache_wq variable static
  nfsd4: fix double free in nfsd4_do_async_copy()
  nfsd: convert file cache to use over/underflow safe refcount
  nfsd: Define the file access mode enum for tracing
  nfsd: Fix a perf warning
  nfsd: Ensure sampling of the write verifier is atomic with the write
  nfsd: Ensure sampling of the commit verifier is atomic with the commit
  sunrpc: clean up cache entry add/remove from hashtable
  sunrpc: Fix potential leaks in sunrpc_cache_unhash()
  nfsd: Ensure exclusion between CLONE and WRITE errors
  nfsd: Pass the nfsd_file as arguments to nfsd4_clone_file_range()
  nfsd: Update the boot verifier on stable writes too.
  nfsd: Fix stable writes
  nfsd: Allow nfsd_vfs_write() to take the nfsd_file as an argument
  nfsd: Fix a soft lockup race in nfsd_file_mark_find_or_create()
  nfsd: Reduce the number of calls to nfsd_file_gc()
  nfsd: Schedule the laundrette regularly irrespective of file errors
  nfsd: Remove unused constant NFSD_FILE_LRU_RESCAN
  nfsd: Containerise filecache laundrette
  ...
2020-02-07 17:50:21 -08:00
Linus Torvalds
f43574d0ac NFS Client Updates for Linux 5.6
Stable bugfixes:
 - Fix memory leaks and corruption in readdir # v2.6.37+
 - Directory page cache needs to be locked when read # v2.6.37+
 
 New features:
 - Convert NFS to use the new mount API
 - Add "softreval" mount option to let clients use cache if server goes down
 - Add a config option to compile without UDP support
 - Limit the number of inactive delegations the client can cache at once
 - Improved readdir concurrency using iterate_shared()
 
 Other bugfixes and cleanups:
 - More 64-bit time conversions
 - Add additional diagnostic tracepoints
 - Check for holes in swapfiles, and add dependency on CONFIG_SWAP
 - Various xprtrdma cleanups to prepare for 5.7's changes
 - Several fixes for NFS writeback and commit handling
 - Fix acls over krb5i/krb5p mounts
 - Recover from premature loss of openstateids
 - Fix NFS v3 chacl and chmod bug
 - Compare creds using cred_fscmp()
 - Use kmemdup_nul() in more places
 - Optimize readdir cache page invalidation
 - Lease renewal and recovery fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAl48kMUACgkQ18tUv7Cl
 QOs/bA/+KAHaee+1jWdgRS88CnNDfeokU2sGWuyXWrVTmiKZ+IjnIUIWqmeKhVyg
 RTbaG4PGTIwiLDFibgzdnc3cTOQEgLnVGWWZ50Xh3b7ubock7+/4JHxqZS+/f3vf
 yqwM0dZaXi5Kcx1kEJ+niBxuzkc9mFI+nHh+wLIlin/kaaUdLKu7mP3NXj2cmWxN
 NoRaKc2gEvkPHhPSH4Z1DVXTHxvH2REFvt9APPUgfLfqcUVHV9b7V/wI/roiGWMn
 53h6f38IdqoNQIpzMog/k/va67NLmEvUZOlpCYPyanPOjuxTrmi8iC2S6gLEOjtc
 GGnQnc5skVL31seFR1NbOJiiN3hTLTncnoXza0cKtYxmo7a/FjXApw4jCu3Rkrav
 UXpCI4O6+2AVVG+pEPbjQy3/GEImeoGvp+xr57jBSZBHoDZU9LDwag65qvZ1btIq
 KOBx2gweQz0aB2heXmfee7qzxFdftHmtMWhIMnJASKNuAWGL23Scqem+d97i2T6H
 7y9OJ3aOXiYxFMLYJCsLWjUJxYiaIANNBmHMjf27mZzcdDuxGFms277CMpNPr3SU
 WZk6/oKw9jaRSzHzaKgVDXiULLXQE1/xZ/mvgR/zk1QAusyeXPvVnMdxoRdxFdXb
 QGZHgUqvFvYi8Lufvs+ZLGS4sAp7oD/Q+lNPXn7cniSwfY4uJiw=
 =b6+F
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Puyll NFS client updates from Anna Schumaker:
 "Stable bugfixes:
   - Fix memory leaks and corruption in readdir # v2.6.37+
   - Directory page cache needs to be locked when read # v2.6.37+

  New features:
   - Convert NFS to use the new mount API
   - Add "softreval" mount option to let clients use cache if server goes down
   - Add a config option to compile without UDP support
   - Limit the number of inactive delegations the client can cache at once
   - Improved readdir concurrency using iterate_shared()

  Other bugfixes and cleanups:
   - More 64-bit time conversions
   - Add additional diagnostic tracepoints
   - Check for holes in swapfiles, and add dependency on CONFIG_SWAP
   - Various xprtrdma cleanups to prepare for 5.7's changes
   - Several fixes for NFS writeback and commit handling
   - Fix acls over krb5i/krb5p mounts
   - Recover from premature loss of openstateids
   - Fix NFS v3 chacl and chmod bug
   - Compare creds using cred_fscmp()
   - Use kmemdup_nul() in more places
   - Optimize readdir cache page invalidation
   - Lease renewal and recovery fixes"

* tag 'nfs-for-5.6-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (93 commits)
  NFSv4.0: nfs4_do_fsinfo() should not do implicit lease renewals
  NFSv4: try lease recovery on NFS4ERR_EXPIRED
  NFS: Fix memory leaks
  nfs: optimise readdir cache page invalidation
  NFS: Switch readdir to using iterate_shared()
  NFS: Use kmemdup_nul() in nfs_readdir_make_qstr()
  NFS: Directory page cache pages need to be locked when read
  NFS: Fix memory leaks and corruption in readdir
  SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
  NFS: Replace various occurrences of kstrndup() with kmemdup_nul()
  NFSv4: Limit the total number of cached delegations
  NFSv4: Add accounting for the number of active delegations held
  NFSv4: Try to return the delegation immediately when marked for return on close
  NFS: Clear NFS_DELEGATION_RETURN_IF_CLOSED when the delegation is returned
  NFSv4: nfs_inode_evict_delegation() should set NFS_DELEGATION_RETURNING
  NFS: nfs_find_open_context() should use cred_fscmp()
  NFS: nfs_access_get_cached_rcu() should use cred_fscmp()
  NFSv4: pnfs_roc() must use cred_fscmp() to compare creds
  NFS: remove unused macros
  nfs: Return EINVAL rather than ERANGE for mount parse errors
  ...
2020-02-07 17:39:56 -08:00
Martin KaFai Lau
88d6f130e5 bpf: Improve bucket_log calculation logic
It was reported that the max_t, ilog2, and roundup_pow_of_two macros have
exponential effects on the number of states in the sparse checker.

This patch breaks them up by calculating the "nbuckets" first so that the
"bucket_log" only needs to take ilog2().

In addition, Linus mentioned:

  Patch looks good, but I'd like to point out that it's not just sparse.

  You can see it with a simple

    make net/core/bpf_sk_storage.i
    grep 'smap->bucket_log = ' net/core/bpf_sk_storage.i | wc

  and see the end result:

      1  365071 2686974

  That's one line (the assignment line) that is 2,686,974 characters in
  length.

  Now, sparse does happen to react particularly badly to that (I didn't
  look to why, but I suspect it's just that evaluating all the types
  that don't actually ever end up getting used ends up being much more
  expensive than it should be), but I bet it's not good for gcc either.

Fixes: 6ac99e8f23 ("bpf: Introduce bpf sk local storage")
Reported-by: Randy Dunlap <rdunlap@infradead.org>
Reported-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Luc Van Oostenryck <luc.vanoostenryck@gmail.com>
Link: https://lore.kernel.org/bpf/20200207081810.3918919-1-kafai@fb.com
2020-02-07 23:01:41 +01:00
Jakub Sitnicki
0b2dc83906 bpf, sockhash: Synchronize_rcu before free'ing map
We need to have a synchronize_rcu before free'ing the sockhash because any
outstanding psock references will have a pointer to the map and when they
use it, this could trigger a use after free.

This is a sister fix for sockhash, following commit 2bb90e5cc9 ("bpf:
sockmap, synchronize_rcu before free'ing map") which addressed sockmap,
which comes from a manual audit.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200206111652.694507-3-jakub@cloudflare.com
2020-02-07 22:36:26 +01:00
Jakub Sitnicki
db6a5018b6 bpf, sockmap: Don't sleep while holding RCU lock on tear-down
rcu_read_lock is needed to protect access to psock inside sock_map_unref
when tearing down the map. However, we can't afford to sleep in lock_sock
while in RCU read-side critical section. Grab the RCU lock only after we
have locked the socket.

This fixes RCU warnings triggerable on a VM with 1 vCPU when free'ing a
sockmap/sockhash that contains at least one socket:

| =============================
| WARNING: suspicious RCU usage
| 5.5.0-04005-g8fc91b972b73 #450 Not tainted
| -----------------------------
| include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
|
| other info that might help us debug this:
|
|
| rcu_scheduler_active = 2, debug_locks = 1
| 4 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_map_free+0x5/0x170
|  #3: ffff8881368c5df8 (&stab->lock){+...}, at: sock_map_free+0x64/0x170
|
| stack backtrace:
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73 #450
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  ___might_sleep+0x105/0x190
|  lock_sock_nested+0x28/0x90
|  sock_map_free+0x95/0x170
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50

| =============================
| WARNING: suspicious RCU usage
| 5.5.0-04005-g8fc91b972b73-dirty #452 Not tainted
| -----------------------------
| include/linux/rcupdate.h:272 Illegal context switch in RCU read-side critical section!
|
| other info that might help us debug this:
|
|
| rcu_scheduler_active = 2, debug_locks = 1
| 4 locks held by kworker/0:1/62:
|  #0: ffff88813b019748 ((wq_completion)events){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #1: ffffc900000abe50 ((work_completion)(&map->work)){+.+.}, at: process_one_work+0x1d7/0x5e0
|  #2: ffffffff82065d20 (rcu_read_lock){....}, at: sock_hash_free+0x5/0x1d0
|  #3: ffff888139966e00 (&htab->buckets[i].lock){+...}, at: sock_hash_free+0x92/0x1d0
|
| stack backtrace:
| CPU: 0 PID: 62 Comm: kworker/0:1 Not tainted 5.5.0-04005-g8fc91b972b73-dirty #452
| Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
| Workqueue: events bpf_map_free_deferred
| Call Trace:
|  dump_stack+0x71/0xa0
|  ___might_sleep+0x105/0x190
|  lock_sock_nested+0x28/0x90
|  sock_hash_free+0xec/0x1d0
|  bpf_map_free_deferred+0x58/0x80
|  process_one_work+0x260/0x5e0
|  worker_thread+0x4d/0x3e0
|  kthread+0x108/0x140
|  ? process_one_work+0x5e0/0x5e0
|  ? kthread_park+0x90/0x90
|  ret_from_fork+0x3a/0x50

Fixes: 7e81a35302 ("bpf: Sockmap, ensure sock lock held during tear down")
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200206111652.694507-2-jakub@cloudflare.com
2020-02-07 22:36:26 +01:00
Lorenz Bauer
85b8ac01a4 bpf, sockmap: Check update requirements after locking
It's currently possible to insert sockets in unexpected states into
a sockmap, due to a TOCTTOU when updating the map from a syscall.
sock_map_update_elem checks that sk->sk_state == TCP_ESTABLISHED,
locks the socket and then calls sock_map_update_common. At this
point, the socket may have transitioned into another state, and
the earlier assumptions don't hold anymore. Crucially, it's
conceivable (though very unlikely) that a socket has become unhashed.
This breaks the sockmap's assumption that it will get a callback
via sk->sk_prot->unhash.

Fix this by checking the (fixed) sk_type and sk_protocol without the
lock, followed by a locked check of sk_state.

Unfortunately it's not possible to push the check down into
sock_(map|hash)_update_common, since BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB
run before the socket has transitioned from TCP_SYN_RECV into
TCP_ESTABLISHED.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Link: https://lore.kernel.org/bpf/20200207103713.28175-1-lmb@cloudflare.com
2020-02-07 22:28:04 +01:00
Al Viro
d7167b1499 fs_parse: fold fs_parameter_desc/fs_parameter_spec
The former contains nothing but a pointer to an array of the latter...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:37 -05:00
Eric Sandeen
96cafb9ccb fs_parser: remove fs_parameter_description name field
Unused now.

Signed-off-by: Eric Sandeen <sandeen@redhat.com>
Acked-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:36 -05:00
Al Viro
c80c98f0dc ceph_parse_param(), ceph_parse_mon_ips(): switch to passing fc_log
... and now errorf() et.al. are never called with NULL fs_context,
so we can get rid of conditional in those.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:34 -05:00
Al Viro
7f5d38141e new primitive: __fs_parse()
fs_parse() analogue taking p_log instead of fs_context.
fs_parse() turned into a wrapper, callers in ceph_common and rbd
switched to __fs_parse().

As the result, fs_parse() never gets NULL fs_context and neither
do fs_context-based logging primitives

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:34 -05:00
Al Viro
2c3f3dc315 switch rbd and libceph to p_log-based primitives
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 14:48:33 -05:00
Roberto Bergantinos Corpas
3d96208c30 sunrpc: expiry_time should be seconds not timeval
When upcalling gssproxy, cache_head.expiry_time is set as a
timeval, not seconds since boot. As such, RPC cache expiry
logic will not clean expired objects created under
auth.rpcsec.context cache.

This has proven to cause kernel memory leaks on field. Using
64 bit variants of getboottime/timespec

Expiration times have worked this way since 2010's c5b29f885a "sunrpc:
use seconds since boot in expiry cache".  The gssproxy code introduced
in 2012 added gss_proxy_save_rsc and introduced the bug.  That's a while
for this to lurk, but it required a bit of an extreme case to make it
obvious.

Signed-off-by: Roberto Bergantinos Corpas <rbergant@redhat.com>
Cc: stable@vger.kernel.org
Fixes: 030d794bf4 "SUNRPC: Use gssproxy upcall for server..."
Tested-By: Frank Sorenson <sorenson@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-02-07 13:30:41 -05:00
Ido Schimmel
dfa7f70959 drop_monitor: Do not cancel uninitialized work item
Drop monitor uses a work item that takes care of constructing and
sending netlink notifications to user space. In case drop monitor never
started to monitor, then the work item is uninitialized and not
associated with a function.

Therefore, a stop command from user space results in canceling an
uninitialized work item which leads to the following warning [1].

Fix this by not processing a stop command if drop monitor is not
currently monitoring.

[1]
[   31.735402] ------------[ cut here ]------------
[   31.736470] WARNING: CPU: 0 PID: 143 at kernel/workqueue.c:3032 __flush_work+0x89f/0x9f0
...
[   31.738120] CPU: 0 PID: 143 Comm: dwdump Not tainted 5.5.0-custom-09491-g16d4077796b8 #727
[   31.741968] RIP: 0010:__flush_work+0x89f/0x9f0
...
[   31.760526] Call Trace:
[   31.771689]  __cancel_work_timer+0x2a6/0x3b0
[   31.776809]  net_dm_cmd_trace+0x300/0xef0
[   31.777549]  genl_rcv_msg+0x5c6/0xd50
[   31.781005]  netlink_rcv_skb+0x13b/0x3a0
[   31.784114]  genl_rcv+0x29/0x40
[   31.784720]  netlink_unicast+0x49f/0x6a0
[   31.787148]  netlink_sendmsg+0x7cf/0xc80
[   31.790426]  ____sys_sendmsg+0x620/0x770
[   31.793458]  ___sys_sendmsg+0xfd/0x170
[   31.802216]  __sys_sendmsg+0xdf/0x1a0
[   31.806195]  do_syscall_64+0xa0/0x540
[   31.806885]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 8e94c3bc92 ("drop_monitor: Allow user to start monitoring hardware drops")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 18:48:36 +01:00
Eric Dumazet
db3fa27102 ipv6/addrconf: fix potential NULL deref in inet6_set_link_af()
__in6_dev_get(dev) called from inet6_set_link_af() can return NULL.

The needed check has been recently removed, let's add it back.

While do_setlink() does call validate_linkmsg() :
...
err = validate_linkmsg(dev, tb); /* OK at this point */
...

It is possible that the following call happening before the
->set_link_af() removes IPv6 if MTU is less than 1280 :

if (tb[IFLA_MTU]) {
    err = dev_set_mtu_ext(dev, nla_get_u32(tb[IFLA_MTU]), extack);
    if (err < 0)
          goto errout;
    status |= DO_SETLINK_MODIFIED;
}
...

if (tb[IFLA_AF_SPEC]) {
   ...
   err = af_ops->set_link_af(dev, af);
      ->inet6_set_link_af() // CRASH because idev is NULL

Please note that IPv4 is immune to the bug since inet_set_link_af() does :

struct in_device *in_dev = __in_dev_get_rcu(dev);
if (!in_dev)
    return -EAFNOSUPPORT;

This problem has been mentioned in commit cf7afbfeb8 ("rtnl: make
link af-specific updates atomic") changelog :

    This method is not fail proof, while it is currently sufficient
    to make set_link_af() inerrable and thus 100% atomic, the
    validation function method will not be able to detect all error
    scenarios in the future, there will likely always be errors
    depending on states which are f.e. not protected by rtnl_mutex
    and thus may change between validation and setting.

IPv6: ADDRCONF(NETDEV_CHANGE): lo: link becomes ready
general protection fault, probably for non-canonical address 0xdffffc0000000056: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x00000000000002b0-0x00000000000002b7]
CPU: 0 PID: 9698 Comm: syz-executor712 Not tainted 5.5.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
FS:  0000000000c46880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 000055f0494ca0d0 CR3: 000000009e4ac000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 do_setlink+0x2a9f/0x3720 net/core/rtnetlink.c:2754
 rtnl_group_changelink net/core/rtnetlink.c:3103 [inline]
 __rtnl_newlink+0xdd1/0x1790 net/core/rtnetlink.c:3257
 rtnl_newlink+0x69/0xa0 net/core/rtnetlink.c:3377
 rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5438
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5456
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x59e/0x7e0 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 ____sys_sendmsg+0x753/0x880 net/socket.c:2343
 ___sys_sendmsg+0x100/0x170 net/socket.c:2397
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2430
 __do_sys_sendmsg net/socket.c:2439 [inline]
 __se_sys_sendmsg net/socket.c:2437 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2437
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4402e9
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffd62fbcf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004402e9
RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000008 R09: 00000000004002c8
R10: 0000000000000005 R11: 0000000000000246 R12: 0000000000401b70
R13: 0000000000401c00 R14: 0000000000000000 R15: 0000000000000000
Modules linked in:
---[ end trace cfa7664b8fdcdff3 ]---
RIP: 0010:inet6_set_link_af+0x66e/0xae0 net/ipv6/addrconf.c:5733
Code: 38 d0 7f 08 84 c0 0f 85 20 03 00 00 48 8d bb b0 02 00 00 45 0f b6 64 24 04 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 <0f> b6 04 02 84 c0 74 08 3c 03 0f 8e 1a 03 00 00 44 89 a3 b0 02 00
RSP: 0018:ffffc90005b06d40 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff86df39a6
RDX: 0000000000000056 RSI: ffffffff86df3e74 RDI: 00000000000002b0
RBP: ffffc90005b06e70 R08: ffff8880a2ac0380 R09: ffffc90005b06db0
R10: fffff52000b60dbe R11: ffffc90005b06df7 R12: 0000000000000000
R13: 0000000000000000 R14: ffff8880a1fcc424 R15: dffffc0000000000
FS:  0000000000c46880(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000004 CR3: 000000009e4ac000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400

Fixes: 7dc2bccab0 ("Validate required parameters in inet6_validate_link_af")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Bisected-and-reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 18:43:23 +01:00
Florian Westphal
a7da92c2c8 netfilter: flowtable: skip offload setup if disabled
nftables test case
tests/shell/testcases/flowtable/0001flowtable_0

results in a crash. After the refactor, if we leave early via
nf_flowtable_hw_offload(), then "struct flow_block_offload" is left
in an uninitialized state, but later users assume its initialised.

Fixes: a7965d58dd ("netfilter: flowtable: add nf_flow_table_offload_cmd()")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-07 15:53:32 +01:00
Cong Wang
8d0015a7ab netfilter: xt_hashlimit: limit the max size of hashtable
The user-specified hashtable size is unbound, this could
easily lead to an OOM or a hung task as we hold the global
mutex while allocating and initializing the new hashtable.

Add a max value to cap both cfg->size and cfg->max, as
suggested by Florian.

Reported-and-tested-by: syzbot+adf6c6c2be1c3a718121@syzkaller.appspotmail.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-07 15:53:31 +01:00
Cong Wang
c4a3922d2d netfilter: xt_hashlimit: reduce hashlimit_mutex scope for htable_put()
It is unnecessary to hold hashlimit_mutex for htable_destroy()
as it is already removed from the global hashtable and its
refcount is already zero.

Also, switch hinfo->use to refcount_t so that we don't have
to hold the mutex until it reaches zero in htable_put().

Reported-and-tested-by: syzbot+adf6c6c2be1c3a718121@syzkaller.appspotmail.com
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-02-07 15:53:31 +01:00
Markus Theil
9b125c2799 mac80211: support NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211_MAC_ADDRS
This is now a trivial patch, but for seeing the actual changes
I (Johannes) split it out from the original.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200115125522.3755-1-markus.theil@tu-ilmenau.de
[split into separate cfg80211/mac80211 patches]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:59:40 +01:00
Markus Theil
8c3ed7aa2b nl80211: add src and dst addr attributes for control port tx/rx
When using control port over nl80211 in AP mode with
pre-authentication, APs need to forward frames to other
APs defined by their MAC address. Before this patch,
pre-auth frames reaching user space over nl80211 control
port  have no longer any information about the dest attached,
which can be used for forwarding to a controller or injecting
the frame back to a ethernet interface over a AF_PACKET
socket.
Analog problems exist, when forwarding pre-auth frames from
AP -> STA.

This patch therefore adds the NL80211_ATTR_DST_MAC and
NL80211_ATTR_SRC_MAC attributes to provide more context
information when forwarding.
The respective arguments are optional on tx and included on rx.
Therefore unaware existing software is not affected.

Software which wants to detect this feature, can do so
by checking against:
  NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211_MAC_ADDRS

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200115125522.3755-1-markus.theil@tu-ilmenau.de
[split into separate cfg80211/mac80211 patches]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:58:37 +01:00
Shaul Triebitz
c0058df733 mac80211: parse also the RSNXE IE
Parse also the RSN Extension IE when parsing the rest of the IEs.
It will be used in a later patch.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-21-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:55:21 +01:00
Ilan Peer
ff74c51e8f cfg80211/mac80211: Allow user space to register for station Rx authentication
To support Pre Association Security Negotiation (PASN) while already
associated to one AP, allow user space to register to Rx
authentication frames, so that the user space logic would be able to
receive/handle authentication frames from a different AP as part of
PASN.

Note that it is expected that user space would intelligently register
for Rx authentication frames, i.e., only when PASN is used and
configure a match filter only for PASN authentication algorithm, as
otherwise the MLME functionality of mac80211 would be broken.

Additionally, since some versions of the user space daemons wrongly
register to all types of authentication frames (which might result in
unexpected behavior) allow such registration if the request is for a
specific authentication algorithm number.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131114529.894206-1-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:54:09 +01:00
Sergey Matyukevich
bfb7bac3a8 cfg80211: check wiphy driver existence for drvinfo report
When preparing ethtool drvinfo, check if wiphy driver is defined
before dereferencing it. Driver may not exist, e.g. if wiphy is
attached to a virtual platform device.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200203105644.28875-1-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:53:26 +01:00
Ilan Peer
52b4810bed mac80211: Remove support for changing AP SMPS mode
The SMPS feature is defined in the specification only to be
used by non-AP stations and not by APs, so remove the support
for changing the AP's SMPS mode dynamically.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-20-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:48:29 +01:00
Ilan Peer
c4d800dcc7 mac80211: Handle SMPS mode changes only in AP mode
According to IEEE802.11 specifications the SM power save field
in the HT capability IE and the HE extended capability IE is valid
only in (re)association frames and should be ignored otherwise.
Remove code paths that handled this also for non AP modes.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-17-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:47:30 +01:00
Andrei Otcheretianski
85b27ef734 mac80211: Accept broadcast probe responses on 6GHz band
An AP that operates on 6GHz may respond with a broadcast probe response.
Don't ignore such frames.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-14-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:47:15 +01:00
Tova Mussai
f46209b9ff mac80211: HE: set RX NSS
In case of HE, the RX NSS is taken from the HE capabilities.
If the supported NSS capabilities that are reported by AP for
HE mode in the HE Capabilities element are different from the NSS
capabilities that are reported by AP for the VHT mode in the VHT
Capabilities element, use the lowest supported NSS to not get all
the values confused.

Signed-off-by: Tova Mussai <tova.mussai@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-9-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:46:37 +01:00
Johannes Berg
cf2c9cc398 mac80211: set station bandwidth from HE capability
Set the station bandwidth in HE capability parsing and from
HE capability as the HT/VHT information will not be present
on the 6 GHz band.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-5-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:45:19 +01:00
Johannes Berg
2ff69b0e25 mac80211: remove supported channels element in 6 GHz if ECSA support
We should not include the supported channels element if we have
(advertise) support for extended channel switching. To avoid any
interop issues because we always added it in the past, obey this
restriction only in the (new) 6 GHz band.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131113111.893106-1-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:43:00 +01:00
Daniel Gabay
8cadb20714 mac80211: update condition for HE disablement
Disable HE if the beacon does not contain an HE operation IE.

Signed-off-by: Daniel Gabay <daniel.gabay@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-16-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:41:58 +01:00
Luca Coelho
4a65cc2437 mac80211: make ieee80211_wep_init() return void
This function always returns 0, so there's no point in returning int.
Make it void and remove the impossible error-path when calling it.

Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-11-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:40:34 +01:00
Johannes Berg
07b83d2ecd mac80211: allow changing TX-related netdev features
Set ndev->hw_features as well as ndev->features to allow
changing the TX-related features with ethtool.

We cannot (yet) change RX-related features since that
requires telling the driver about it and we have no API
for that yet.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-10-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:40:20 +01:00
Haim Dreyfuss
b5db1acab1 mac80211: check whether HE connection is allowed by the reg domain
The wireless device might be capable to connect HE
as well as the AP. However, the regulatory domain might
forbid it. Check whether the regulatory domain allows HE connection
when considering if HE IE should be added.
Also, add it when setting our peer capability.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-8-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:39:27 +01:00
Johannes Berg
e4d005b80d mac80211: refactor extended element parsing
This code was really ugly, refactor it a bit to make it more
readable.

While at it, use sizeof() and fix the UORA element length
check bug.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-4-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:38:21 +01:00
Johannes Berg
75e296e9b2 mac80211: simplify and improve HT/VHT/HE disable code
Check early on that a device has support for QoS (at least 4
queues) when it supports HT/VHT/HE, so we don't have to check
this while connecting.

This lets us clean up the code there: move some of it into
channel preparation to clean up a bit more, and then change
the logic to only check the "wmm_used" flag.

Additionally, disable HE consistently when VHT is disabled.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-3-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:37:54 +01:00
Veerendranath Jakkam
d6039a3416 cfg80211: Enhance the AKM advertizement to support per interface.
Commit ab4dfa2053 ("cfg80211: Allow drivers to advertise supported AKM
suites") introduces the support to advertize supported AKMs to userspace.

This needs an enhancement to advertize the AKM support per interface type,
specifically for the cfg80211-based drivers that implement SME and use
different mechanisms to support the AKM's for each interface type (e.g.,
the support for SAE, OWE AKM's take different paths for such drivers on
STA/AP mode).

This commit aims the same and enhances the earlier mechanism of advertizing
the AKMs per wiphy. Add new nl80211 attributes and data structure to
provide supported AKMs per interface type to userspace.

the AKMs advertized in akm_suites are default capabilities if not
advertized for a specific interface type in iftype_akm_suites.

Signed-off-by: Veerendranath Jakkam <vjakkam@codeaurora.org>
Link: https://lore.kernel.org/r/20200126203032.21934-1-vjakkam@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:34:26 +01:00
Haim Dreyfuss
1e61d82cca cfg80211: add no HE indication to the channel flag
The regulatory domain might forbid HE operation.  Certain regulatory
domains may restrict it for specific channels whereas others may do it
for the whole regulatory domain.

Add an option to indicate it in the channel flag.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200121081213.733757-1-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:34:09 +01:00
Johannes Berg
a04564c99b mac80211: consider more elements in parsing CRC
We only use the parsing CRC for checking if a beacon changed,
and elements with an ID > 63 cannot be represented in the
filter. Thus, like we did before with WMM and Cisco vendor
elements, just statically add these forgotten items to the
CRC:
 - WLAN_EID_VHT_OPERATION
 - WLAN_EID_OPMODE_NOTIF

I guess that in most cases when VHT/HE operation change, the HT
operation also changed, and so the change was picked up, but we
did notice that pure operating mode notification changes were
ignored.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-22-luca@coelho.fi
[restrict to VHT for the mac80211 branch]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 11:44:02 +01:00
Sara Sharon
2bf973ff9b mac80211: fix quiet mode activation in action frames
Previously I intended to ignore quiet mode in probe response, however
I ended up ignoring it instead for action frames. As a matter of fact,
this path isn't invoked for probe responses to start with. Just revert
this patch.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Fixes: 7976b1e9e3 ("mac80211: ignore quiet mode in probe")
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-15-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 11:44:02 +01:00
Johannes Berg
f2b18baca9 mac80211: use more bits for ack_frame_id
It turns out that this wasn't a good idea, I hit a test failure in
hwsim due to this. That particular failure was easily worked around,
but it raised questions: if an AP needs to, for example, send action
frames to each connected station, the current limit is nowhere near
enough (especially if those stations are sleeping and the frames are
queued for a while.)

Shuffle around some bits to make more room for ack_frame_id to allow
up to 8192 queued up frames, that's enough for queueing 4 frames to
each connected station, even at the maximum of 2007 stations on a
single AP.

We take the bits from band (which currently only 2 but I leave 3 in
case we add another band) and from the hw_queue, which can only need
4 since it has a limit of 16 queues.

Fixes: 6912daed05 ("mac80211: Shrink the size of ack_frame_id to make room for tx_time_est")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200115122549.b9a4ef9f4980.Ied52ed90150220b83a280009c590b65d125d087c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 11:32:27 +01:00
John Crispin
3c706b973b mac80211: fix 11w when using encapsulation offloading
The 802.11 encapsulation returned early when setting up the keys in case
offloading was enabled. This causes ieee802.11w to not work anymore. Fix
this by moving the check for offloading into the switch/case construct and
allowing CCMP/GCMP keys. With this patch applied ieee80211w works again
when enabling offloading.

Fixes: 50ff477a86 ("mac80211: add 802.11 encapsulation offloading support")
Reported-by: Maharaja Kennadyrajan <mkenna@qti.qualcomm.com>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200203122812.18993-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 11:31:36 +01:00
Vinicius Costa Gomes
bfabd41da3 taprio: Fix dropping packets when using taprio + ETF offloading
When using taprio offloading together with ETF offloading, configured
like this, for example:

$ tc qdisc replace dev $IFACE parent root handle 100 taprio \
  	num_tc 4 \
        map 2 2 1 0 3 2 2 2 2 2 2 2 2 2 2 2 \
	queues 1@0 1@1 1@2 1@3 \
	base-time $BASE_TIME \
	sched-entry S 01 1000000 \
	sched-entry S 0e 1000000 \
	flags 0x2

$ tc qdisc replace dev $IFACE parent 100:1 etf \
     	offload delta 300000 clockid CLOCK_TAI

During enqueue, it works out that the verification added for the
"txtime" assisted mode is run when using taprio + ETF offloading, the
only thing missing is initializing the 'next_txtime' of all the cycle
entries. (if we don't set 'next_txtime' all packets from SO_TXTIME
sockets are dropped)

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:30:03 +01:00
Vinicius Costa Gomes
7c16680a08 taprio: Use taprio_reset_tc() to reset Traffic Classes configuration
When destroying the current taprio instance, which can happen when the
creation of one fails, we should reset the traffic class configuration
back to the default state.

netdev_reset_tc() is a better way because in addition to setting the
number of traffic classes to zero, it also resets the priority to
traffic classes mapping to the default value.

Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:30:03 +01:00
Vinicius Costa Gomes
49c684d79c taprio: Add missing policy validation for flags
netlink policy validation for the 'flags' argument was missing.

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:30:03 +01:00
Vinicius Costa Gomes
a9d6227436 taprio: Fix still allowing changing the flags during runtime
Because 'q->flags' starts as zero, and zero is a valid value, we
aren't able to detect the transition from zero to something else
during "runtime".

The solution is to initialize 'q->flags' with an invalid value, so we
can detect if 'q->flags' was set by the user or not.

To better solidify the behavior, 'flags' handling is moved to a
separate function. The behavior is:
 - 'flags' if unspecified by the user, is assumed to be zero;
 - 'flags' cannot change during "runtime" (i.e. a change() request
 cannot modify it);

With this new function we can remove taprio_flags, which should reduce
the risk of future accidents.

Allowing flags to be changed was causing the following RCU stall:

[ 1730.558249] rcu: INFO: rcu_preempt detected stalls on CPUs/tasks:
[ 1730.558258] rcu: 	  6-...0: (190 ticks this GP) idle=922/0/0x1 softirq=25580/25582 fqs=16250
[ 1730.558264] 		  (detected by 2, t=65002 jiffies, g=33017, q=81)
[ 1730.558269] Sending NMI from CPU 2 to CPUs 6:
[ 1730.559277] NMI backtrace for cpu 6
[ 1730.559277] CPU: 6 PID: 0 Comm: swapper/6 Tainted: G            E     5.5.0-rc6+ #35
[ 1730.559278] Hardware name: Gigabyte Technology Co., Ltd. Z390 AORUS ULTRA/Z390 AORUS ULTRA-CF, BIOS F7 03/14/2019
[ 1730.559278] RIP: 0010:__hrtimer_run_queues+0xe2/0x440
[ 1730.559278] Code: 48 8b 43 28 4c 89 ff 48 8b 75 c0 48 89 45 c8 e8 f4 bb 7c 00 0f 1f 44 00 00 65 8b 05 40 31 f0 68 89 c0 48 0f a3 05 3e 5c 25 01 <0f> 82 fc 01 00 00 48 8b 45 c8 48 89 df ff d0 89 45 c8 0f 1f 44 00
[ 1730.559279] RSP: 0018:ffff9970802d8f10 EFLAGS: 00000083
[ 1730.559279] RAX: 0000000000000006 RBX: ffff8b31645bff38 RCX: 0000000000000000
[ 1730.559280] RDX: 0000000000000000 RSI: ffffffff9710f2ec RDI: ffffffff978daf0e
[ 1730.559280] RBP: ffff9970802d8f68 R08: 0000000000000000 R09: 0000000000000000
[ 1730.559280] R10: 0000018336d7944e R11: 0000000000000001 R12: ffff8b316e39f9c0
[ 1730.559281] R13: ffff8b316e39f940 R14: ffff8b316e39f998 R15: ffff8b316e39f7c0
[ 1730.559281] FS:  0000000000000000(0000) GS:ffff8b316e380000(0000) knlGS:0000000000000000
[ 1730.559281] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1730.559281] CR2: 00007f1105303760 CR3: 0000000227210005 CR4: 00000000003606e0
[ 1730.559282] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1730.559282] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 1730.559282] Call Trace:
[ 1730.559282]  <IRQ>
[ 1730.559283]  ? taprio_dequeue_soft+0x2d0/0x2d0 [sch_taprio]
[ 1730.559283]  hrtimer_interrupt+0x104/0x220
[ 1730.559283]  ? irqtime_account_irq+0x34/0xa0
[ 1730.559283]  smp_apic_timer_interrupt+0x6d/0x230
[ 1730.559284]  apic_timer_interrupt+0xf/0x20
[ 1730.559284]  </IRQ>
[ 1730.559284] RIP: 0010:cpu_idle_poll+0x35/0x1a0
[ 1730.559285] Code: 88 82 ff 65 44 8b 25 12 7d 73 68 0f 1f 44 00 00 e8 90 c3 89 ff fb 65 48 8b 1c 25 c0 7e 01 00 48 8b 03 a8 08 74 0b eb 1c f3 90 <48> 8b 03 a8 08 75 13 8b 05 be a8 a8 00 85 c0 75 ed e8 75 48 84 ff
[ 1730.559285] RSP: 0018:ffff997080137ea8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
[ 1730.559285] RAX: 0000000000000001 RBX: ffff8b316bc3c580 RCX: 0000000000000000
[ 1730.559286] RDX: 0000000000000001 RSI: 000000002819aad9 RDI: ffffffff978da730
[ 1730.559286] RBP: ffff997080137ec0 R08: 0000018324a6d387 R09: 0000000000000000
[ 1730.559286] R10: 0000000000000400 R11: 0000000000000001 R12: 0000000000000006
[ 1730.559286] R13: ffff8b316bc3c580 R14: 0000000000000000 R15: 0000000000000000
[ 1730.559287]  ? cpu_idle_poll+0x20/0x1a0
[ 1730.559287]  ? cpu_idle_poll+0x20/0x1a0
[ 1730.559287]  do_idle+0x4d/0x1f0
[ 1730.559287]  ? complete+0x44/0x50
[ 1730.559288]  cpu_startup_entry+0x1b/0x20
[ 1730.559288]  start_secondary+0x142/0x180
[ 1730.559288]  secondary_startup_64+0xb6/0xc0
[ 1776.686313] nvme nvme0: I/O 96 QID 1 timeout, completion polled

Fixes: 4cfd5779bd ("taprio: Add support for txtime-assist mode")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:30:03 +01:00
Vinicius Costa Gomes
5652e63df3 taprio: Fix enabling offload with wrong number of traffic classes
If the driver implementing taprio offloading depends on the value of
the network device number of traffic classes (dev->num_tc) for
whatever reason, it was going to receive the value zero. The value was
only set after the offloading function is called.

So, moving setting the number of traffic classes to before the
offloading function is called fixes this issue. This is safe because
this only happens when taprio is instantiated (we don't allow this
configuration to be changed without first removing taprio).

Fixes: 9c66d15646 ("taprio: Add support for hardware offloading")
Reported-by: Po Liu <po.liu@nxp.com>
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:30:03 +01:00
David Howells
963485d436 rxrpc: Fix call RCU cleanup using non-bh-safe locks
rxrpc_rcu_destroy_call(), which is called as an RCU callback to clean up a
put call, calls rxrpc_put_connection() which, deep in its bowels, takes a
number of spinlocks in a non-BH-safe way, including rxrpc_conn_id_lock and
local->client_conns_lock.  RCU callbacks, however, are normally called from
softirq context, which can cause lockdep to notice the locking
inconsistency.

To get lockdep to detect this, it's necessary to have the connection
cleaned up on the put at the end of the last of its calls, though normally
the clean up is deferred.  This can be induced, however, by starting a call
on an AF_RXRPC socket and then closing the socket without reading the
reply.

Fix this by having rxrpc_rcu_destroy_call() punt the destruction to a
workqueue if in softirq-mode and defer the destruction to process context.

Note that another way to fix this could be to add a bunch of bh-disable
annotations to the spinlocks concerned - and there might be more than just
those two - but that means spending more time with BHs disabled.

Note also that some of these places were covered by bh-disable spinlocks
belonging to the rxrpc_transport object, but these got removed without the
_bh annotation being retained on the next lock in.

Fixes: 999b69f892 ("rxrpc: Kill the client connection bundle concept")
Reported-by: syzbot+d82f3ac8d87e7ccbb2c9@syzkaller.appspotmail.com
Reported-by: syzbot+3f1fd6b8cbf8702d134e@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Hillf Danton <hdanton@sina.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:20:57 +01:00
David Howells
b39a934ec7 rxrpc: Fix service call disconnection
The recent patch that substituted a flag on an rxrpc_call for the
connection pointer being NULL as an indication that a call was disconnected
puts the set_bit in the wrong place for service calls.  This is only a
problem if a call is implicitly terminated by a new call coming in on the
same connection channel instead of a terminating ACK packet.

In such a case, rxrpc_input_implicit_end_call() calls
__rxrpc_disconnect_call(), which is now (incorrectly) setting the
disconnection bit, meaning that when rxrpc_release_call() is later called,
it doesn't call rxrpc_disconnect_call() and so the call isn't removed from
the peer's error distribution list and the list gets corrupted.

KASAN finds the issue as an access after release on a call, but the
position at which it occurs is confusing as it appears to be related to a
different call (the call site is where the latter call is being removed
from the error distribution list and either the next or pprev pointer
points to a previously released call).

Fix this by moving the setting of the flag from __rxrpc_disconnect_call()
to rxrpc_disconnect_call() in the same place that the connection pointer
was being cleared.

Fixes: 5273a191dc ("rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-07 11:19:38 +01:00
Al Viro
2710c957a8 fs_parse: get rid of ->enums
Don't do a single array; attach them to fsparam_enum() entry
instead.  And don't bother trying to embed the names into those -
it actually loses memory, with no real speedup worth mentioning.

Simplifies validation as well.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-02-07 00:12:50 -05:00
Dan Carpenter
7a02ea6502 net: sched: prevent a use after free
The bug is that we call kfree_skb(skb) and then pass "skb" to
qdisc_pkt_len(skb) on the next line, which is a use after free.
Also Cong Wang points out that it's better to delay the actual
frees until we drop the rtnl lock so we should use rtnl_kfree_skbs()
instead of kfree_skb().

Cc: Cong Wang <xiyou.wangcong@gmail.com>
Fixes: ec97ecf1eb ("net: sched: add Flow Queue PIE packet scheduler")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-06 14:00:12 +01:00
Qian Cai
86b18aaa2b skbuff: fix a data race in skb_queue_len()
sk_buff.qlen can be accessed concurrently as noticed by KCSAN,

 BUG: KCSAN: data-race in __skb_try_recv_from_queue / unix_dgram_sendmsg

 read to 0xffff8a1b1d8a81c0 of 4 bytes by task 5371 on cpu 96:
  unix_dgram_sendmsg+0x9a9/0xb70 include/linux/skbuff.h:1821
				 net/unix/af_unix.c:1761
  ____sys_sendmsg+0x33e/0x370
  ___sys_sendmsg+0xa6/0xf0
  __sys_sendmsg+0x69/0xf0
  __x64_sys_sendmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

 write to 0xffff8a1b1d8a81c0 of 4 bytes by task 1 on cpu 99:
  __skb_try_recv_from_queue+0x327/0x410 include/linux/skbuff.h:2029
  __skb_try_recv_datagram+0xbe/0x220
  unix_dgram_recvmsg+0xee/0x850
  ____sys_recvmsg+0x1fb/0x210
  ___sys_recvmsg+0xa2/0xf0
  __sys_recvmsg+0x66/0xf0
  __x64_sys_recvmsg+0x51/0x70
  do_syscall_64+0x91/0xb47
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Since only the read is operating as lockless, it could introduce a logic
bug in unix_recvq_full() due to the load tearing. Fix it by adding
a lockless variant of skb_queue_len() and unix_recvq_full() where
READ_ONCE() is on the read while WRITE_ONCE() is on the write similar to
the commit d7d16a8935 ("net: add skb_queue_empty_lockless()").

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-06 13:59:10 +01:00
Nicolas Dichtel
f1ed10264e vti[6]: fix packet tx through bpf_redirect() in XinY cases
I forgot the 4in6/6in4 cases in my previous patch. Let's fix them.

Fixes: 95224166a9 ("vti[6]: fix packet tx through bpf_redirect()")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-06 13:27:30 +01:00
Linus Torvalds
4c46bef2e9 We have:
- a set of patches that fixes various corner cases in mount and umount
   code (Xiubo Li).  This has to do with choosing an MDS, distinguishing
   between laggy and down MDSes and parsing the server path.
 
 - inode initialization fixes (Jeff Layton).  The one included here
   mostly concerns things like open_by_handle() and there is another
   one that will come through Al.
 
 - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
   The existing copy-from op turned out to be infeasible for generic
   filesystem use; we disable the copy offload if OSDs don't support
   copy-from2.
 
 - a patch to link "rbd" and "block" devices together in sysfs (Hannes
   Reinecke)
 
 And a smattering of cleanups from Xiubo, Jeff and Chengguang.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl47PUcTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi6LoCACmVli5N6bgnBE4sTixi/jz6aCCbk32
 ZPlKiSesHnOGkY6KXHJT58JYy0paITBRik5ypdz06J8aCOtWyPLbn3uCemF9CYn2
 g6dId2Lf5vGFrgSm4YSiqp9a86IZmYSDG41LbJD/IJWFDWdMWqNPMDqji6yaIO5O
 NJI5N0tk+VFXdV+JyjV9X/FnP1r1D2ReZzz21ZiqTJXSmE8YIkioLjkq36QTMMG7
 Gm5qdlc1x2r4qfzA1g+OiWgRQCUMgkuYerFzus4mVbW4hrphsavH2DArbOwFmsXF
 46hOq+1uGVVyZILLJfKNiktf1GExBF0icbSREJtmjUHbQvNR8BH0C+fV
 =vvIc
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client

Pull ceph fixes from Ilya Dryomov:

 - a set of patches that fixes various corner cases in mount and umount
   code (Xiubo Li). This has to do with choosing an MDS, distinguishing
   between laggy and down MDSes and parsing the server path.

 - inode initialization fixes (Jeff Layton). The one included here
   mostly concerns things like open_by_handle() and there is another one
   that will come through Al.

 - copy_file_range() now uses the new copy-from2 op (Luis Henriques).
   The existing copy-from op turned out to be infeasible for generic
   filesystem use; we disable the copy offload if OSDs don't support
   copy-from2.

 - a patch to link "rbd" and "block" devices together in sysfs (Hannes
   Reinecke)

... and a smattering of cleanups from Xiubo, Jeff and Chengguang.

* tag 'ceph-for-5.6-rc1' of https://github.com/ceph/ceph-client: (25 commits)
  rbd: set the 'device' link in sysfs
  ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
  ceph: print name of xattr in __ceph_{get,set}xattr() douts
  ceph: print r_direct_hash in hex in __choose_mds() dout
  ceph: use copy-from2 op in copy_file_range
  ceph: close holes in structs ceph_mds_session and ceph_mds_request
  rbd: work around -Wuninitialized warning
  ceph: allocate the correct amount of extra bytes for the session features
  ceph: rename get_session and switch to use ceph_get_mds_session
  ceph: remove the extra slashes in the server path
  ceph: add possible_max_rank and make the code more readable
  ceph: print dentry offset in hex and fix xattr_version type
  ceph: only touch the caps which have the subset mask requested
  ceph: don't clear I_NEW until inode metadata is fully populated
  ceph: retry the same mds later after the new session is opened
  ceph: check availability of mds cluster on mount after wait timeout
  ceph: keep the session state until it is released
  ceph: add __send_request helper
  ceph: ensure we have a new cap before continuing in fill_inode
  ceph: drop unused ttl_from parameter from fill_inode
  ...
2020-02-06 12:21:01 +00:00
Florian Westphal
b0519de8b3 mptcp: fix use-after-free for ipv6
Turns out that when we accept a new subflow, the newly created
inet_sk(tcp_sk)->pinet6 points at the ipv6_pinfo structure of the
listener socket.

This wasn't caught by the selftest because it closes the accepted fd
before the listening one.

adding a close(listenfd) after accept returns is enough:
 BUG: KASAN: use-after-free in inet6_getname+0x6ba/0x790
 Read of size 1 at addr ffff88810e310866 by task mptcp_connect/2518
 Call Trace:
  inet6_getname+0x6ba/0x790
  __sys_getpeername+0x10b/0x250
  __x64_sys_getpeername+0x6f/0xb0

also alter test program to exercise this.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-06 11:25:09 +01:00
Jacob Keller
d5b90e99e1 devlink: report 0 after hitting end in region read
commit fdd41ec21e ("devlink: Return right error code in case of errors
for region read") modified the region read code to report errors
properly in unexpected cases.

In the case where the start_offset and ret_offset match, it unilaterally
converted this into an error. This causes an issue for the "dump"
version of the command. In this case, the devlink region dump will
always report an invalid argument:

000000000000ffd0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
000000000000ffe0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
devlink answers: Invalid argument
000000000000fff0 ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

This occurs because the expected flow for the dump is to return 0 after
there is no further data.

The simplest fix would be to stop converting the error code to -EINVAL
if start_offset == ret_offset. However, avoid unnecessary work by
checking for when start_offset is larger than the region size and
returning 0 upfront.

Fixes: fdd41ec21e ("devlink: Return right error code in case of errors for region read")
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-05 14:23:20 +01:00
Cong Wang
52b5ae501c net_sched: fix a resource leak in tcindex_set_parms()
Jakub noticed there is a potential resource leak in
tcindex_set_parms(): when tcindex_filter_result_init() fails
and it jumps to 'errout1' which doesn't release the memory
and resources allocated by tcindex_alloc_perfect_hash().

We should just jump to 'errout_alloc' which calls
tcindex_free_perfect_hash().

Fixes: b9a24bb76b ("net_sched: properly handle failure case of tcf_exts_init()")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-05 14:11:57 +01:00
Florian Westphal
2c22c06ce4 mptcp: fix use-after-free on tcp fallback
When an mptcp socket connects to a tcp peer or when a middlebox interferes
with tcp options, mptcp needs to fall back to plain tcp.
Problem is that mptcp is trying to be too clever in this case:

It attempts to close the mptcp meta sk and transparently replace it with
the (only) subflow tcp sk.

Unfortunately, this is racy -- the socket is already exposed to userspace.
Any parallel calls to send/recv/setsockopt etc. can cause use-after-free:

BUG: KASAN: use-after-free in atomic_try_cmpxchg include/asm-generic/atomic-instrumented.h:693 [inline]
CPU: 1 PID: 2083 Comm: syz-executor.1 Not tainted 5.5.0 #2
 atomic_try_cmpxchg include/asm-generic/atomic-instrumented.h:693 [inline]
 queued_spin_lock include/asm-generic/qspinlock.h:78 [inline]
 do_raw_spin_lock include/linux/spinlock.h:181 [inline]
 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:136 [inline]
 _raw_spin_lock_bh+0x71/0xd0 kernel/locking/spinlock.c:175
 spin_lock_bh include/linux/spinlock.h:343 [inline]
 __lock_sock+0x105/0x190 net/core/sock.c:2414
 lock_sock_nested+0x10f/0x140 net/core/sock.c:2938
 lock_sock include/net/sock.h:1516 [inline]
 mptcp_setsockopt+0x2f/0x1f0 net/mptcp/protocol.c:800
 __sys_setsockopt+0x152/0x240 net/socket.c:2130
 __do_sys_setsockopt net/socket.c:2146 [inline]
 __se_sys_setsockopt net/socket.c:2143 [inline]
 __x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
 do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

While the use-after-free can be resolved, there is another problem:
sock->ops and sock->sk assignments are not atomic, i.e. we may get calls
into mptcp functions with sock->sk already pointing at the subflow socket,
or calls into tcp functions with a mptcp meta sk.

Remove the fallback code and call the relevant functions for the (only)
subflow in case the mptcp socket is connected to tcp peer.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Diagnosed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-05 14:08:03 +01:00
Alex Shi
2ade42d88f Bluetooth: remove __get_channel/dir and __dir
These 3 macros are never used from first git commit Linux-2.6.12-rc2.
let's remove them.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-05 09:07:30 +01:00
Hillf Danton
2a154903ce Bluetooth: prefetch channel before killing sock
Prefetch channel before killing sock in order to fix UAF like

 BUG: KASAN: use-after-free in l2cap_sock_release+0x24c/0x290 net/bluetooth/l2cap_sock.c:1212
 Read of size 8 at addr ffff8880944904a0 by task syz-fuzzer/9751

Reported-by: syzbot+c3c5bdea7863886115dc@syzkaller.appspotmail.com
Fixes: 6c08fc896b ("Bluetooth: Fix refcount use-after-free issue")
Cc: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Hillf Danton <hdanton@sina.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-05 09:05:04 +01:00
Linus Torvalds
33b40134e5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use after free in rxrpc_put_local(), from David Howells.

 2) Fix 64-bit division error in mlxsw, from Nathan Chancellor.

 3) Make sure we clear various bits of TCP state in response to
    tcp_disconnect(). From Eric Dumazet.

 4) Fix netlink attribute policy in cls_rsvp, from Eric Dumazet.

 5) txtimer must be deleted in stmmac suspend(), from Nicolin Chen.

 6) Fix TC queue mapping in bnxt_en driver, from Michael Chan.

 7) Various netdevsim fixes from Taehee Yoo (use of uninitialized data,
    snapshot panics, stack out of bounds, etc.)

 8) cls_tcindex changes hash table size after allocating the table, fix
    from Cong Wang.

 9) Fix regression in the enforcement of session ID uniqueness in l2tp.
    We only have to enforce uniqueness for IP based tunnels not UDP
    ones. From Ridge Kennedy.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (46 commits)
  gtp: use __GFP_NOWARN to avoid memalloc warning
  l2tp: Allow duplicate session creation with UDP
  r8152: Add MAC passthrough support to new device
  net_sched: fix an OOB access in cls_tcindex
  qed: Remove set but not used variable 'p_link'
  tc-testing: add missing 'nsPlugin' to basic.json
  tc-testing: fix eBPF tests failure on linux fresh clones
  net: hsr: fix possible NULL deref in hsr_handle_frame()
  netdevsim: remove unused sdev code
  netdevsim: use __GFP_NOWARN to avoid memalloc warning
  netdevsim: use IS_ERR instead of IS_ERR_OR_NULL for debugfs
  netdevsim: fix stack-out-of-bounds in nsim_dev_debugfs_init()
  netdevsim: fix panic in nsim_dev_take_snapshot_write()
  netdevsim: disable devlink reload when resources are being used
  netdevsim: fix using uninitialized resources
  bnxt_en: Fix TC queue mapping.
  bnxt_en: Fix logic that disables Bus Master during firmware reset.
  bnxt_en: Fix RDMA driver failure with SRIOV after firmware reset.
  bnxt_en: Refactor logic to re-enable SRIOV after firmware reset detected.
  net: stmmac: Delete txtimer in suspend()
  ...
2020-02-04 13:32:20 +00:00
Raed Salem
03891f820c xfrm: handle NETDEV_UNREGISTER for xfrm device
This patch to handle the asynchronous unregister
device event so the device IPsec offload resources
could be cleanly released.

Fixes: e4db5b61c5 ("xfrm: policy: remove pcpu policy cache")
Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-04 12:43:46 +01:00
Ridge Kennedy
0d0d9a388a l2tp: Allow duplicate session creation with UDP
In the past it was possible to create multiple L2TPv3 sessions with the
same session id as long as the sessions belonged to different tunnels.
The resulting sessions had issues when used with IP encapsulated tunnels,
but worked fine with UDP encapsulated ones. Some applications began to
rely on this behaviour to avoid having to negotiate unique session ids.

Some time ago a change was made to require session ids to be unique across
all tunnels, breaking the applications making use of this "feature".

This change relaxes the duplicate session id check to allow duplicates
if both of the colliding sessions belong to UDP encapsulated tunnels.

Fixes: dbdbc73b44 ("l2tp: fix duplicate session creation")
Signed-off-by: Ridge Kennedy <ridge.kennedy@alliedtelesis.co.nz>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-04 12:35:49 +01:00
Cong Wang
599be01ee5 net_sched: fix an OOB access in cls_tcindex
As Eric noticed, tcindex_alloc_perfect_hash() uses cp->hash
to compute the size of memory allocation, but cp->hash is
set again after the allocation, this caused an out-of-bound
access.

So we have to move all cp->hash initialization and computation
before the memory allocation. Move cp->mask and cp->shift together
as cp->hash may need them for computation too.

Reported-and-tested-by: syzbot+35d4dea36c387813ed31@syzkaller.appspotmail.com
Fixes: 331b72922c ("net: sched: RCU cls_tcindex")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-04 11:41:36 +01:00
Eric Dumazet
2b5b8251bc net: hsr: fix possible NULL deref in hsr_handle_frame()
hsr_port_get_rcu() can return NULL, so we need to be careful.

general protection fault, probably for non-canonical address 0xdffffc0000000006: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000030-0x0000000000000037]
CPU: 1 PID: 10249 Comm: syz-executor.5 Not tainted 5.5.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__read_once_size include/linux/compiler.h:199 [inline]
RIP: 0010:hsr_addr_is_self+0x86/0x330 net/hsr/hsr_framereg.c:44
Code: 04 00 f3 f3 f3 65 48 8b 04 25 28 00 00 00 48 89 45 d0 31 c0 e8 6b ff 94 f9 4c 89 f2 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 <80> 3c 02 00 0f 85 75 02 00 00 48 8b 43 30 49 39 c6 49 89 47 c0 0f
RSP: 0018:ffffc90000da8a90 EFLAGS: 00010206
RAX: dffffc0000000000 RBX: 0000000000000000 RCX: ffffffff87e0cc33
RDX: 0000000000000006 RSI: ffffffff87e035d5 RDI: 0000000000000000
RBP: ffffc90000da8b20 R08: ffff88808e7de040 R09: ffffed1015d2707c
R10: ffffed1015d2707b R11: ffff8880ae9383db R12: ffff8880a689bc5e
R13: 1ffff920001b5153 R14: 0000000000000030 R15: ffffc90000da8af8
FS:  00007fd7a42be700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32338000 CR3: 00000000a928c000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 hsr_handle_frame+0x1c5/0x630 net/hsr/hsr_slave.c:31
 __netif_receive_skb_core+0xfbc/0x30b0 net/core/dev.c:5099
 __netif_receive_skb_one_core+0xa8/0x1a0 net/core/dev.c:5196
 __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5312
 process_backlog+0x206/0x750 net/core/dev.c:6144
 napi_poll net/core/dev.c:6582 [inline]
 net_rx_action+0x508/0x1120 net/core/dev.c:6650
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
 </IRQ>

Fixes: c5a7591172 ("net/hsr: Use list_head (and rcu) instead of array for slave devices.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-04 09:27:07 +01:00
Masahiro Yamada
45586c7078 treewide: remove redundant IS_ERR() before error code check
'PTR_ERR(p) == -E*' is a stronger condition than IS_ERR(p).
Hence, IS_ERR(p) is unneeded.

The semantic patch that generates this commit is as follows:

// <smpl>
@@
expression ptr;
constant error_code;
@@
-IS_ERR(ptr) && (PTR_ERR(ptr) == - error_code)
+PTR_ERR(ptr) == - error_code
// </smpl>

Link: http://lkml.kernel.org/r/20200106045833.1725-1-masahiroy@kernel.org
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Cc: Julia Lawall <julia.lawall@lip6.fr>
Acked-by: Stephen Boyd <sboyd@kernel.org> [drivers/clk/clk.c]
Acked-by: Bartosz Golaszewski <bgolaszewski@baylibre.com> [GPIO]
Acked-by: Wolfram Sang <wsa@the-dreams.de> [drivers/i2c]
Acked-by: Rafael J. Wysocki <rafael.j.wysocki@intel.com> [acpi/scan.c]
Acked-by: Rob Herring <robh@kernel.org>
Cc: Eric Biggers <ebiggers@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:27 +00:00
Alexey Dobriyan
97a32539b9 proc: convert everything to "struct proc_ops"
The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-02-04 03:05:26 +00:00
Trond Myklebust
7ccbddbe3e SUNRPC: Use kmemdup_nul() in rpc_parse_scope_id()
Using kmemdup_nul() is more efficient when the length is known.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-02-03 16:35:07 -05:00
Jakub Kicinski
3d80c653f9 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl439W0ACgkQ+7dXa6fL
 C2uT5RAAl+rytJS768Aa6c/S9TPBaNOKzFQmgNaWzAtgC5BAtfZf1Bl0okFOba2V
 QtcFmTogRl7BEnf9S3w3z25qj7Nv5udjIaAbH7cO6j1vuIwRpvtEZ7WXlE3L4hxA
 DscSysqI9L5ISsnzfloUvA/biA9azsH2ckgMxFo++YmLHvagXW0lmkE4yZw+aZ/T
 DstFAMAnFQwEAyu+Et3bo32/382aJh+gBGgV/2wuHBPgzQMhiWdosKdNnEZtZ3V2
 HvgVuAU2V4hOf8OuaPuBnZrJPondTq5e1W+5mQiYLYzTCTOs6rZ1gUGfEXth3d1H
 ABJ0FoyPgN783msb0yL6OOLMWiTc9USiLB8a3U2vJOD+hQTkPSYSGsCObeIGVW/Y
 dbBd5fwudF78LIP9TjYWbau3vHtV73N4Mc1UT219jew0+Hi1ik5VIj8q0wdBIkzm
 vKcVznXnPReOUBVKmqA2scnXs8EA4w6bTCjyd3JYUBK2qLchz366s+0oYyIK8Nq0
 dy8TVISaCSfohI4nAYAv90AUIDa+zdHKuttBsAD90yVM/wrnokqc/RonfWjHs32k
 SerM0RclPidool3Hkc/w0uvPywDQg/S5sKAW6E6lpfTppOwIY0wq/yZc5N2d9FZV
 UrPZVGhQszC7AM/CBKXDT0ZMrPnAfGO8PMPLyFrAcv2p9O7XuWA=
 =kHIG
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200203' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
RxRPC fixes

Here are a number of fixes for AF_RXRPC:

 (1) Fix a potential use after free in rxrpc_put_local() where it was
     accessing the object just put to get tracing information.

 (2) Fix insufficient notifications being generated by the function that
     queues data packets on a call.  This occasionally causes recvmsg() to
     stall indefinitely.

 (3) Fix a number of packet-transmitting work functions to hold an active
     count on the local endpoint so that the UDP socket doesn't get
     destroyed whilst they're calling kernel_sendmsg() on it.

 (4) Fix a NULL pointer deref that stemmed from a call's connection pointer
     being cleared when the call was disconnected.

Changes:

 v2: Removed a couple of BUG() statements that got added.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-03 10:26:23 -08:00
Masahiro Yamada
5f2fb52fac kbuild: rename hostprogs-y/always to hostprogs/always-y
In old days, the "host-progs" syntax was used for specifying host
programs. It was renamed to the current "hostprogs-y" in 2004.

It is typically useful in scripts/Makefile because it allows Kbuild to
selectively compile host programs based on the kernel configuration.

This commit renames like follows:

  always       ->  always-y
  hostprogs-y  ->  hostprogs

So, scripts/Makefile will look like this:

  always-$(CONFIG_BUILD_BIN2C) += ...
  always-$(CONFIG_KALLSYMS)    += ...
      ...
  hostprogs := $(always-y) $(always-m)

I think this makes more sense because a host program is always a host
program, irrespective of the kernel configuration. We want to specify
which ones to compile by CONFIG options, so always-y will be handier.

The "always", "hostprogs-y", "hostprogs-m" will be kept for backward
compatibility for a while.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
2020-02-04 01:53:07 +09:00
David Howells
5273a191dc rxrpc: Fix NULL pointer deref due to call->conn being cleared on disconnect
When a call is disconnected, the connection pointer from the call is
cleared to make sure it isn't used again and to prevent further attempted
transmission for the call.  Unfortunately, there might be a daemon trying
to use it at the same time to transmit a packet.

Fix this by keeping call->conn set, but setting a flag on the call to
indicate disconnection instead.

Remove also the bits in the transmission functions where the conn pointer is
checked and a ref taken under spinlock as this is now redundant.

Fixes: 8d94aa381d ("rxrpc: Calls shouldn't hold socket refs")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-02-03 10:25:30 +00:00
SeongJae Park
9603d47bad tcp: Reduce SYN resend delay if a suspicous ACK is received
When closing a connection, the two acks that required to change closing
socket's status to FIN_WAIT_2 and then TIME_WAIT could be processed in
reverse order.  This is possible in RSS disabled environments such as a
connection inside a host.

For example, expected state transitions and required packets for the
disconnection will be similar to below flow.

	 00 (Process A)				(Process B)
	 01 ESTABLISHED				ESTABLISHED
	 02 close()
	 03 FIN_WAIT_1
	 04 		---FIN-->
	 05 					CLOSE_WAIT
	 06 		<--ACK---
	 07 FIN_WAIT_2
	 08 		<--FIN/ACK---
	 09 TIME_WAIT
	 10 		---ACK-->
	 11 					LAST_ACK
	 12 CLOSED				CLOSED

In some cases such as LINGER option applied socket, the FIN and FIN/ACK
will be substituted to RST and RST/ACK, but there is no difference in
the main logic.

The acks in lines 6 and 8 are the acks.  If the line 8 packet is
processed before the line 6 packet, it will be just ignored as it is not
a expected packet, and the later process of the line 6 packet will
change the status of Process A to FIN_WAIT_2, but as it has already
handled line 8 packet, it will not go to TIME_WAIT and thus will not
send the line 10 packet to Process B.  Thus, Process B will left in
CLOSE_WAIT status, as below.

	 00 (Process A)				(Process B)
	 01 ESTABLISHED				ESTABLISHED
	 02 close()
	 03 FIN_WAIT_1
	 04 		---FIN-->
	 05 					CLOSE_WAIT
	 06 				(<--ACK---)
	 07	  			(<--FIN/ACK---)
	 08 				(fired in right order)
	 09 		<--FIN/ACK---
	 10 		<--ACK---
	 11 		(processed in reverse order)
	 12 FIN_WAIT_2

Later, if the Process B sends SYN to Process A for reconnection using
the same port, Process A will responds with an ACK for the last flow,
which has no increased sequence number.  Thus, Process A will send RST,
wait for TIMEOUT_INIT (one second in default), and then try
reconnection.  If reconnections are frequent, the one second latency
spikes can be a big problem.  Below is a tcpdump results of the problem:

    14.436259 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644
    14.436266 IP 127.0.0.1.4242 > 127.0.0.1.45150: Flags [.], ack 5, win 512
    14.436271 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [R], seq 2541101298
    /* ONE SECOND DELAY */
    15.464613 IP 127.0.0.1.45150 > 127.0.0.1.4242: Flags [S], seq 2560603644

This commit mitigates the problem by reducing the delay for the next SYN
if the suspicous ACK is received while in SYN_SENT state.

Following commit will add a selftest, which can be also helpful for
understanding of this issue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-02 13:33:21 -08:00
Jakub Kicinski
b7c3a17c60 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix suspicious RCU usage in ipset, from Jozsef Kadlecsik.

2) Use kvcalloc, from Joe Perches.

3) Flush flowtable hardware workqueue after garbage collection run,
   from Paul Blakey.

4) Missing flowtable hardware workqueue flush from nf_flow_table_free(),
   also from Paul.

5) Restore NF_FLOW_HW_DEAD in flow_offload_work_del(), from Paul.

6) Flowtable documentation fixes, from Matteo Croce.
====================

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-01 12:38:20 -08:00
Eric Dumazet
cb3c0e6bdf cls_rsvp: fix rsvp_policy
NLA_BINARY can be confusing, since .len value represents
the max size of the blob.

cls_rsvp really wants user space to provide long enough data
for TCA_RSVP_DST and TCA_RSVP_SRC attributes.

BUG: KMSAN: uninit-value in rsvp_get net/sched/cls_rsvp.h:258 [inline]
BUG: KMSAN: uninit-value in gen_handle net/sched/cls_rsvp.h:402 [inline]
BUG: KMSAN: uninit-value in rsvp_change+0x1ae9/0x4220 net/sched/cls_rsvp.h:572
CPU: 1 PID: 13228 Comm: syz-executor.1 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 rsvp_get net/sched/cls_rsvp.h:258 [inline]
 gen_handle net/sched/cls_rsvp.h:402 [inline]
 rsvp_change+0x1ae9/0x4220 net/sched/cls_rsvp.h:572
 tc_new_tfilter+0x31fe/0x5010 net/sched/cls_api.c:2104
 rtnetlink_rcv_msg+0xcb7/0x1570 net/core/rtnetlink.c:5415
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5442
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45b349
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f269d43dc78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f269d43e6d4 RCX: 000000000045b349
RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 00000000000009c2 R14: 00000000004cb338 R15: 000000000075bfd4

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
 slab_alloc_node mm/slub.c:2774 [inline]
 __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4382
 __kmalloc_reserve net/core/skbuff.c:141 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:209
 alloc_skb include/linux/skbuff.h:1049 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline]
 netlink_sendmsg+0x7d3/0x14d0 net/netlink/af_netlink.c:1892
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 6fa8c0144b ("[NET_SCHED]: Use nla_policy for attribute validation in classifiers")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-02-01 12:25:06 -08:00
Eric Dumazet
784f8344de tcp: clear tp->segs_{in|out} in tcp_disconnect()
tp->segs_in and tp->segs_out need to be cleared in tcp_disconnect().

tcp_disconnect() is rarely used, but it is worth fixing it.

Fixes: 2efd055c53 ("tcp: add tcpi_segs_in and tcpi_segs_out to tcp_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Marcelo Ricardo Leitner <mleitner@redhat.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31 22:12:37 -08:00
Eric Dumazet
db7ffee6f3 tcp: clear tp->data_segs{in|out} in tcp_disconnect()
tp->data_segs_in and tp->data_segs_out need to be cleared
in tcp_disconnect().

tcp_disconnect() is rarely used, but it is worth fixing it.

Fixes: a44d6eacda ("tcp: Add RFC4898 tcpEStatsPerfDataSegsOut/In")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31 22:12:22 -08:00
Eric Dumazet
2fbdd56251 tcp: clear tp->delivered in tcp_disconnect()
tp->delivered needs to be cleared in tcp_disconnect().

tcp_disconnect() is rarely used, but it is worth fixing it.

Fixes: ddf1af6fa0 ("tcp: new delivery accounting")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31 22:12:18 -08:00
Eric Dumazet
c13c48c00a tcp: clear tp->total_retrans in tcp_disconnect()
total_retrans needs to be cleared in tcp_disconnect().

tcp_disconnect() is rarely used, but it is worth fixing it.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: SeongJae Park <sjpark@amazon.de>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-01-31 22:11:35 -08:00
Linus Torvalds
7eec11d3a7 Merge branch 'akpm' (patches from Andrew)
Pull updates from Andrew Morton:
 "Most of -mm and quite a number of other subsystems: hotfixes, scripts,
  ocfs2, misc, lib, binfmt, init, reiserfs, exec, dma-mapping, kcov.

  MM is fairly quiet this time.  Holidays, I assume"

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (118 commits)
  kcov: ignore fault-inject and stacktrace
  include/linux/io-mapping.h-mapping: use PHYS_PFN() macro in io_mapping_map_atomic_wc()
  execve: warn if process starts with executable stack
  reiserfs: prevent NULL pointer dereference in reiserfs_insert_item()
  init/main.c: fix misleading "This architecture does not have kernel memory protection" message
  init/main.c: fix quoted value handling in unknown_bootoption
  init/main.c: remove unnecessary repair_env_string in do_initcall_level
  init/main.c: log arguments and environment passed to init
  fs/binfmt_elf.c: coredump: allow process with empty address space to coredump
  fs/binfmt_elf.c: coredump: delete duplicated overflow check
  fs/binfmt_elf.c: coredump: allocate core ELF header on stack
  fs/binfmt_elf.c: make BAD_ADDR() unlikely
  fs/binfmt_elf.c: better codegen around current->mm
  fs/binfmt_elf.c: don't copy ELF header around
  fs/binfmt_elf.c: fix ->start_code calculation
  fs/binfmt_elf.c: smaller code generation around auxv vector fill
  lib/find_bit.c: uninline helper _find_next_bit()
  lib/find_bit.c: join _find_next_bit{_le}
  uapi: rename ext2_swab() to swab() and share globally in swab.h
  lib/scatterlist.c: adjust indentation in __sg_alloc_table
  ...
2020-01-31 12:16:36 -08:00
Paul Blakey
c22208b7ce netfilter: flowtable: Fix setting forgotten NF_FLOW_HW_DEAD flag
During the refactor this was accidently removed.

Fixes: ae29045018 ("netfilter: flowtable: add nf_flow_offload_tuple() helper")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-31 19:31:42 +01:00
Paul Blakey
0f34f30a1b netfilter: flowtable: Fix missing flush hardware on table free
If entries exist when freeing a hardware offload enabled table,
we queue work for hardware while running the gc iteration.

Execute it (flush) after queueing.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-31 19:31:41 +01:00
Paul Blakey
91bfaa15a3 netfilter: flowtable: Fix hardware flush order on nf_flow_table_cleanup
On netdev down event, nf_flow_table_cleanup() is called for the relevant
device and it cleans all the tables that are on that device.
If one of those tables has hardware offload flag,
nf_flow_table_iterate_cleanup flushes hardware and then runs the gc.
But the gc can queue more hardware work, which will take time to execute.

Instead first add the work, then flush it, to execute it now.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-31 19:31:40 +01:00
Joe Perches
b9e0102a57 netfilter: Use kvcalloc
Convert the uses of kvmalloc_array with __GFP_ZERO to
the equivalent kvcalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-31 19:30:54 +01:00
John Hubbard
f1f6a7dd9b mm, tree-wide: rename put_user_page*() to unpin_user_page*()
In order to provide a clearer, more symmetric API for pinning and
unpinning DMA pages.  This way, pin_user_pages*() calls match up with
unpin_user_pages*() calls, and the API is a lot closer to being
self-explanatory.

Link: http://lkml.kernel.org/r/20200107224558.2362728-23-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:38 -08:00
John Hubbard
fb48b4746a net/xdp: set FOLL_PIN via pin_user_pages()
Convert net/xdp to use the new pin_longterm_pages() call, which sets
FOLL_PIN.  Setting FOLL_PIN is now required for code that requires
tracking of pinned pages.

In partial anticipation of this work, the net/xdp code was already calling
put_user_page() instead of put_page().  Therefore, in order to convert
from the get_user_pages()/put_page() model, to the
pin_user_pages()/put_user_page() model, the only change required here is
to change get_user_pages() to pin_user_pages().

Link: http://lkml.kernel.org/r/20200107224558.2362728-18-jhubbard@nvidia.com
Signed-off-by: John Hubbard <jhubbard@nvidia.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Christoph Hellwig <hch@lst.de>
Cc: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Dan Williams <dan.j.williams@intel.com>
Cc: Hans Verkuil <hverkuil-cisco@xs4all.nl>
Cc: Ira Weiny <ira.weiny@intel.com>
Cc: Jan Kara <jack@suse.cz>
Cc: Jason Gunthorpe <jgg@mellanox.com>
Cc: Jason Gunthorpe <jgg@ziepe.ca>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: Jerome Glisse <jglisse@redhat.com>
Cc: Jonathan Corbet <corbet@lwn.net>
Cc: Kirill A. Shutemov <kirill@shutemov.name>
Cc: Leon Romanovsky <leonro@mellanox.com>
Cc: Mauro Carvalho Chehab <mchehab@kernel.org>
Cc: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-01-31 10:30:37 -08:00
David Howells
04d36d748f rxrpc: Fix missing active use pinning of rxrpc_local object
The introduction of a split between the reference count on rxrpc_local
objects and the usage count didn't quite go far enough.  A number of kernel
work items need to make use of the socket to perform transmission.  These
also need to get an active count on the local object to prevent the socket
from being closed.

Fix this by getting the active count in those places.

Also split out the raw active count get/put functions as these places tend
to hold refs on the rxrpc_local object already, so getting and putting an
extra object ref is just a waste of time.

The problem can lead to symptoms like:

    BUG: kernel NULL pointer dereference, address: 0000000000000018
    ..
    CPU: 2 PID: 818 Comm: kworker/u9:0 Not tainted 5.5.0-fscache+ #51
    ...
    RIP: 0010:selinux_socket_sendmsg+0x5/0x13
    ...
    Call Trace:
     security_socket_sendmsg+0x2c/0x3e
     sock_sendmsg+0x1a/0x46
     rxrpc_send_keepalive+0x131/0x1ae
     rxrpc_peer_keepalive_worker+0x219/0x34b
     process_one_work+0x18e/0x271
     worker_thread+0x1a3/0x247
     kthread+0xe6/0xeb
     ret_from_fork+0x1f/0x30

Fixes: 730c5fd42c ("rxrpc: Fix local endpoint refcounting")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-30 21:50:41 +00:00
David Howells
f71dbf2fb2 rxrpc: Fix insufficient receive notification generation
In rxrpc_input_data(), rxrpc_notify_socket() is called if the base sequence
number of the packet is immediately following the hard-ack point at the end
of the function.  However, this isn't sufficient, since the recvmsg side
may have been advancing the window and then overrun the position in which
we're adding - at which point rx_hard_ack >= seq0 and no notification is
generated.

Fix this by always generating a notification at the end of the input
function.

Without this, a long call may stall, possibly indefinitely.

Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-30 21:50:41 +00:00
David Howells
fac20b9e73 rxrpc: Fix use-after-free in rxrpc_put_local()
Fix rxrpc_put_local() to not access local->debug_id after calling
atomic_dec_return() as, unless that returned n==0, we no longer have the
right to access the object.

Fixes: 06d9532fa6 ("rxrpc: Fix read-after-free in rxrpc_queue_local()")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-01-30 21:50:41 +00:00
Linus Torvalds
1c715a659a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Various mptcp fixupes from Florian Westphal and Geery Uytterhoeven.

 2) Don't clear the node/port GUIDs after we've assigned the correct
    values to them. From Leon Romanovsky.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
  net/core: Do not clear VF index for node/port GUIDs query
  mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6
  net: drop_monitor: Use kstrdup
  udp: document udp_rcv_segment special case for looped packets
  mptcp: MPTCP_HMAC_TEST should depend on MPTCP
  mptcp: Fix incorrect IPV6 dependency check
  Revert "MAINTAINERS: mptcp@ mailing list is moderated"
  mptcp: handle tcp fallback when using syn cookies
  mptcp: avoid a lockdep splat when mcast group was joined
  mptcp: fix panic on user pointer access
  mptcp: defer freeing of cached ext until last moment
  net: mvneta: fix XDP support if sw bm is used as fallback
  sch_choke: Use kvcalloc
  mptcp: Fix build with PROC_FS disabled.
  MAINTAINERS: mptcp@ mailing list is moderated
2020-01-30 07:42:00 -08:00
Leon Romanovsky
9fbf082f56 net/core: Do not clear VF index for node/port GUIDs query
VF numbers were assigned to node_guid and port_guid, but cleared
right before such query calls were issued. It caused to return
node/port GUIDs of VF index 0 for all VFs.

Fixes: 30aad41721 ("net/core: Add support for getting VF GUIDs")
Reported-by: Adrian Chiris <adrianc@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 15:20:26 +01:00
Geert Uytterhoeven
31484d56ca mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6
If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

    ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!

This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.

As exporting a symbol for an empty function would be a bit wasteful, fix
this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
CONFIG_MPTCP_IPV6=n case.

Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
clear this is a pure-IPV6 function, just like mptcpv6_init().

Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 10:55:54 +01:00
Joe Perches
72d62c4e42 net: drop_monitor: Use kstrdup
Convert the equivalent but rather odd uses of kmemdup with
__GFP_ZERO to the more common kstrdup and avoid unnecessary
zeroing of copied over memory.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 10:03:37 +01:00
Geert Uytterhoeven
389b8fb3c4 mptcp: MPTCP_HMAC_TEST should depend on MPTCP
As the MPTCP HMAC test is integrated into the MPTCP code, it can be
built only when MPTCP is enabled.  Hence when MPTCP is disabled, asking
the user if the test code should be enabled is futile.

Wrap the whole block of MPTCP-specific config options inside a check for
MPTCP.  While at it, drop the "default n" for MPTCP_HMAC_TEST, as that
is the default anyway.

Fixes: 65492c5a6a ("mptcp: move from sha1 (v0) to sha256 (v1)")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 09:55:37 +01:00
Geert Uytterhoeven
8e1974a2a0 mptcp: Fix incorrect IPV6 dependency check
If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

    net/mptcp/protocol.o: In function `__mptcp_tcp_fallback':
    protocol.c:(.text+0x786): undefined reference to `inet6_stream_ops'

Fix this by checking for CONFIG_MPTCP_IPV6 instead of CONFIG_IPV6, like
is done in all other places in the mptcp code.

Fixes: 8ab183deb2 ("mptcp: cope with later TCP fallback")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 09:55:37 +01:00
Linus Torvalds
22b17db4ea y2038: core, driver and file system changes
These are updates to device drivers and file systems that for some reason
 or another were not included in the kernel in the previous y2038 series.
 
 I've gone through all users of time_t again to make sure the kernel is
 in a long-term maintainable state, replacing all remaining references
 to time_t with safe alternatives.
 
 Some related parts of the series were picked up into the nfsd, xfs,
 alsa and v4l2 trees. A final set of patches in linux-mm removes the now
 unused time_t/timeval/timespec types and helper functions after all five
 branches are merged for linux-5.6, ensuring that no new users get merged.
 
 As a result, linux-5.6, or my backport of the patches to 5.4 [1], should
 be the first release that can serve as a base for a 32-bit system designed
 to run beyond year 2038, with a few remaining caveats:
 
 - All user space must be compiled with a 64-bit time_t, which will be
   supported in the coming musl-1.2 and glibc-2.32 releases, along with
   installed kernel headers from linux-5.6 or higher.
 
 - Applications that use the system call interfaces directly need to be
   ported to use the time64 syscalls added in linux-5.1 in place of the
   existing system calls. This impacts most users of futex() and seccomp()
   as well as programming languages that have their own runtime environment
   not based on libc.
 
 - Applications that use a private copy of kernel uapi header files or
   their contents may need to update to the linux-5.6 version, in
   particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
   linux/elfcore.h, linux/sockios.h, linux/timex.h and linux/can/bcm.h.
 
 - A few remaining interfaces cannot be changed to pass a 64-bit time_t
   in a compatible way, so they must be configured to use CLOCK_MONOTONIC
   times or (with a y2106 problem) unsigned 32-bit timestamps. Most
   importantly this impacts all users of 'struct input_event'.
 
 - All y2038 problems that are present on 64-bit machines also apply to
   32-bit machines. In particular this affects file systems with on-disk
   timestamps using signed 32-bit seconds: ext4 with ext3-style small
   inodes, ext2, xfs (to be fixed soon) and ufs.
 
 Changes since v1 [2]:
 
 - Add Acks I received
 - Rebase to v5.5-rc1, dropping patches that got merged already
 - Add NFS, XFS and the final three patches from another series
 - Rewrite etnaviv patches
 - Add one late revert to avoid an etnaviv regression
 
 [1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame
 [2] https://lore.kernel.org/lkml/20191108213257.3097633-1-arnd@arndb.de/
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
 iQIcBAABCAAGBQJeMYy3AAoJEGCrR//JCVInEGwP/0R+S+ok7vw9OdLVT0lFl07D
 IcVabgOWf24imN7m7L7Mlt3nDfxIT4tMpiAXq7eMO3spcyViG18O2LXdSQ4/7QBp
 +BlhoMjOP9w34Jyd7mnkFr4vqQALvfIqkS8rFObDtDub2Rfj9PC36MRMIu8BPXlv
 RK8bigwJeH/DV38yc5/JeUcD+WuewYLsK9XPWN+4yB4vgGsNU3ZQQ6nnzbR3hMsN
 DN8WZ68Y7IBs0Kyxkf+s2zmRXtCa2RiFg/2TUsk5olVAJVaenvte69hq5RSbg1vW
 vLi6K8cBoPWL59nqCzcNE+TUhSUg3LOj/a/KWyl76yovz7AlJaNjssOf8ZjHw6sL
 MhQqz3hXTxiJDS2Jvbf1yojiYGlzrq/gqcRFGe9jPcZdieMc4/yZCx60G/Exa5Pu
 YdMcqMyDWPFyUAFQNWEF59HPheOdj6tb1KpJ6bwgCo3P7QqhLrU4z9w3Py4/ZfBO
 4sWcWteSsD6MN/ADJ2WQ56nNxzM2AvkeVJKcF6FCkdngXX9T0GExmZz7SqB5Du99
 9lNjIiD5E+LBa/Swo/7n49aYa8x06V1pmHYTZVh9Wkl+CZiO21umezQFrWsfaMTp
 xt3c6pFdMG5xNMGpreTAXOmf2R+T6O8IO2qQq/TYjzqOLH7QC830P7avkmml+cK1
 LjOBE2TfSeO8Ru1dXV4t
 =wx0A
 -----END PGP SIGNATURE-----

Merge tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground

Pull y2038 updates from Arnd Bergmann:
 "Core, driver and file system changes

  These are updates to device drivers and file systems that for some
  reason or another were not included in the kernel in the previous
  y2038 series.

  I've gone through all users of time_t again to make sure the kernel is
  in a long-term maintainable state, replacing all remaining references
  to time_t with safe alternatives.

  Some related parts of the series were picked up into the nfsd, xfs,
  alsa and v4l2 trees. A final set of patches in linux-mm removes the
  now unused time_t/timeval/timespec types and helper functions after
  all five branches are merged for linux-5.6, ensuring that no new users
  get merged.

  As a result, linux-5.6, or my backport of the patches to 5.4 [1],
  should be the first release that can serve as a base for a 32-bit
  system designed to run beyond year 2038, with a few remaining caveats:

   - All user space must be compiled with a 64-bit time_t, which will be
     supported in the coming musl-1.2 and glibc-2.32 releases, along
     with installed kernel headers from linux-5.6 or higher.

   - Applications that use the system call interfaces directly need to
     be ported to use the time64 syscalls added in linux-5.1 in place of
     the existing system calls. This impacts most users of futex() and
     seccomp() as well as programming languages that have their own
     runtime environment not based on libc.

   - Applications that use a private copy of kernel uapi header files or
     their contents may need to update to the linux-5.6 version, in
     particular for sound/asound.h, xfs/xfs_fs.h, linux/input.h,
     linux/elfcore.h, linux/sockios.h, linux/timex.h and
     linux/can/bcm.h.

   - A few remaining interfaces cannot be changed to pass a 64-bit
     time_t in a compatible way, so they must be configured to use
     CLOCK_MONOTONIC times or (with a y2106 problem) unsigned 32-bit
     timestamps. Most importantly this impacts all users of 'struct
     input_event'.

   - All y2038 problems that are present on 64-bit machines also apply
     to 32-bit machines. In particular this affects file systems with
     on-disk timestamps using signed 32-bit seconds: ext4 with
     ext3-style small inodes, ext2, xfs (to be fixed soon) and ufs"

[1] https://git.kernel.org/pub/scm/linux/kernel/git/arnd/playground.git/log/?h=y2038-endgame

* tag 'y2038-drivers-for-v5.6-signed' of git://git.kernel.org:/pub/scm/linux/kernel/git/arnd/playground: (21 commits)
  Revert "drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC"
  y2038: sh: remove timeval/timespec usage from headers
  y2038: sparc: remove use of struct timex
  y2038: rename itimerval to __kernel_old_itimerval
  y2038: remove obsolete jiffies conversion functions
  nfs: fscache: use timespec64 in inode auxdata
  nfs: fix timstamp debug prints
  nfs: use time64_t internally
  sunrpc: convert to time64_t for expiry
  drm/etnaviv: avoid deprecated timespec
  drm/etnaviv: reject timeouts with tv_nsec >= NSEC_PER_SEC
  drm/msm: avoid using 'timespec'
  hfs/hfsplus: use 64-bit inode timestamps
  hostfs: pass 64-bit timestamps to/from user space
  packet: clarify timestamp overflow
  tsacct: add 64-bit btime field
  acct: stop using get_seconds()
  um: ubd: use 64-bit time_t where possible
  xtensa: ISS: avoid struct timeval
  dlm: use SO_SNDTIMEO_NEW instead of SO_SNDTIMEO_OLD
  ...
2020-01-29 14:55:47 -08:00
Davidlohr Bueso
fe66483156 Bluetooth: optimize barrier usage for Rmw atomics
Use smp_mb__before_atomic() instead of smp_mb() and avoid the
unnecessary barrier for non LL/SC architectures, such as x86.

Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-29 19:50:44 +01:00
Kadlecsik József
5038517119 netfilter: ipset: fix suspicious RCU usage in find_set_and_id
find_set_and_id() is called when the NFNL_SUBSYS_IPSET mutex is held.
However, in the error path there can be a follow-up recvmsg() without
the mutex held. Use the start() function of struct netlink_dump_control
instead of dump() to verify and report if the specified set does not
exist.

Thanks to Pablo Neira Ayuso for helping me to understand the subleties
of the netlink protocol.

Reported-by: syzbot+fc69d7cb21258ab4ae4d@syzkaller.appspotmail.com
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-29 18:34:46 +01:00
Florian Westphal
ae2dd71649 mptcp: handle tcp fallback when using syn cookies
We can't deal with syncookie mode yet, the syncookie rx path will create
tcp reqsk, i.e. we get OOB access because we treat tcp reqsk as mptcp reqsk one:

TCP: SYN flooding on port 20002. Sending cookies.
BUG: KASAN: slab-out-of-bounds in subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
Read of size 1 at addr ffff8881167bc148 by task syz-executor099/2120
 subflow_syn_recv_sock+0x451/0x4d0 net/mptcp/subflow.c:191
 tcp_get_cookie_sock+0xcf/0x520 net/ipv4/syncookies.c:209
 cookie_v6_check+0x15a5/0x1e90 net/ipv6/syncookies.c:252
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1123 [inline]
 [..]

Bug can be reproduced via "sysctl net.ipv4.tcp_syncookies=2".

Note that MPTCP should work with syncookies (4th ack would carry needed
state), but it appears better to sort that out in -next so do tcp
fallback for now.

I removed the MPTCP ifdef for tcp_rsk "is_mptcp" member because
if (IS_ENABLED()) is easier to read than "#ifdef IS_ENABLED()/#endif" pair.

Cc: Eric Dumazet <edumazet@google.com>
Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-29 17:45:20 +01:00
Florian Westphal
b2c5b614ca mptcp: avoid a lockdep splat when mcast group was joined
syzbot triggered following lockdep splat:

ffffffff82d2cd40 (rtnl_mutex){+.+.}, at: ip_mc_drop_socket+0x52/0x180
but task is already holding lock:
ffff8881187a2310 (sk_lock-AF_INET){+.+.}, at: mptcp_close+0x18/0x30

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:
-> #1 (sk_lock-AF_INET){+.+.}:
       lock_acquire+0xee/0x230
       lock_sock_nested+0x89/0xc0
       do_ip_setsockopt.isra.0+0x335/0x22f0
       ip_setsockopt+0x35/0x60
       tcp_setsockopt+0x5d/0x90
       __sys_setsockopt+0xf3/0x190
       __x64_sys_setsockopt+0x61/0x70
       do_syscall_64+0x72/0x300
       entry_SYSCALL_64_after_hwframe+0x49/0xbe
-> #0 (rtnl_mutex){+.+.}:
       check_prevs_add+0x2b7/0x1210
       __lock_acquire+0x10b6/0x1400
       lock_acquire+0xee/0x230
       __mutex_lock+0x120/0xc70
       ip_mc_drop_socket+0x52/0x180
       inet_release+0x36/0xe0
       __sock_release+0xfd/0x130
       __mptcp_close+0xa8/0x1f0
       inet_release+0x7f/0xe0
       __sock_release+0x69/0x130
       sock_close+0x18/0x20
       __fput+0x179/0x400
       task_work_run+0xd5/0x110
       do_exit+0x685/0x1510
       do_group_exit+0x7e/0x170
       __x64_sys_exit_group+0x28/0x30
       do_syscall_64+0x72/0x300
       entry_SYSCALL_64_after_hwframe+0x49/0xbe

The trigger is:
  socket(AF_INET, SOCK_STREAM, 0x106 /* IPPROTO_MPTCP */) = 4
  setsockopt(4, SOL_IP, MCAST_JOIN_GROUP, {gr_interface=7, gr_group={sa_family=AF_INET, sin_port=htons(20003), sin_addr=inet_addr("224.0.0.2")}}, 136) = 0
  exit(0)

Which results in a call to rtnl_lock while we are holding
the parent mptcp socket lock via
mptcp_close -> lock_sock(msk) -> inet_release -> ip_mc_drop_socket -> rtnl_lock().

>From lockdep point of view we thus have both
'rtnl_lock; lock_sock' and 'lock_sock; rtnl_lock'.

Fix this by stealing the msk conn_list and doing the subflow close
without holding the msk lock.

Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-29 17:45:20 +01:00
Florian Westphal
50e741bb3b mptcp: fix panic on user pointer access
Its not possible to call the kernel_(s|g)etsockopt functions here,
the address points to user memory:

General protection fault in user access. Non-canonical address?
WARNING: CPU: 1 PID: 5352 at arch/x86/mm/extable.c:77 ex_handler_uaccess+0xba/0xe0 arch/x86/mm/extable.c:77
Kernel panic - not syncing: panic_on_warn set ...
[..]
Call Trace:
 fixup_exception+0x9d/0xcd arch/x86/mm/extable.c:178
 general_protection+0x2d/0x40 arch/x86/entry/entry_64.S:1202
 do_ip_getsockopt+0x1f6/0x1860 net/ipv4/ip_sockglue.c:1323
 ip_getsockopt+0x87/0x1c0 net/ipv4/ip_sockglue.c:1561
 tcp_getsockopt net/ipv4/tcp.c:3691 [inline]
 tcp_getsockopt+0x8c/0xd0 net/ipv4/tcp.c:3685
 kernel_getsockopt+0x121/0x1f0 net/socket.c:3736
 mptcp_getsockopt+0x69/0x90 net/mptcp/protocol.c:830
 __sys_getsockopt+0x13a/0x220 net/socket.c:2175

We can call tcp_get/setsockopt functions instead.  Doing so fixes
crashing, but still leaves rtnl related lockdep splat:

     WARNING: possible circular locking dependency detected
     5.5.0-rc6 #2 Not tainted
     ------------------------------------------------------
     syz-executor.0/16334 is trying to acquire lock:
     ffffffff84f7a080 (rtnl_mutex){+.+.}, at: do_ip_setsockopt.isra.0+0x277/0x3820 net/ipv4/ip_sockglue.c:644
     but task is already holding lock:
     ffff888116503b90 (sk_lock-AF_INET){+.+.}, at: lock_sock include/net/sock.h:1516 [inline]
     ffff888116503b90 (sk_lock-AF_INET){+.+.}, at: mptcp_setsockopt+0x28/0x90 net/mptcp/protocol.c:1284

     which lock already depends on the new lock.
     the existing dependency chain (in reverse order) is:

     -> #1 (sk_lock-AF_INET){+.+.}:
            lock_sock_nested+0xca/0x120 net/core/sock.c:2944
            lock_sock include/net/sock.h:1516 [inline]
            do_ip_setsockopt.isra.0+0x281/0x3820 net/ipv4/ip_sockglue.c:645
            ip_setsockopt+0x44/0xf0 net/ipv4/ip_sockglue.c:1248
            udp_setsockopt+0x5d/0xa0 net/ipv4/udp.c:2639
            __sys_setsockopt+0x152/0x240 net/socket.c:2130
            __do_sys_setsockopt net/socket.c:2146 [inline]
            __se_sys_setsockopt net/socket.c:2143 [inline]
            __x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
            do_syscall_64+0xbd/0x5b0 arch/x86/entry/common.c:294
            entry_SYSCALL_64_after_hwframe+0x49/0xbe

     -> #0 (rtnl_mutex){+.+.}:
            check_prev_add kernel/locking/lockdep.c:2475 [inline]
            check_prevs_add kernel/locking/lockdep.c:2580 [inline]
            validate_chain kernel/locking/lockdep.c:2970 [inline]
            __lock_acquire+0x1fb2/0x4680 kernel/locking/lockdep.c:3954
            lock_acquire+0x127/0x330 kernel/locking/lockdep.c:4484
            __mutex_lock_common kernel/locking/mutex.c:956 [inline]
            __mutex_lock+0x158/0x1340 kernel/locking/mutex.c:1103
            do_ip_setsockopt.isra.0+0x277/0x3820 net/ipv4/ip_sockglue.c:644
            ip_setsockopt+0x44/0xf0 net/ipv4/ip_sockglue.c:1248
            tcp_setsockopt net/ipv4/tcp.c:3159 [inline]
            tcp_setsockopt+0x8c/0xd0 net/ipv4/tcp.c:3153
            kernel_setsockopt+0x121/0x1f0 net/socket.c:3767
            mptcp_setsockopt+0x69/0x90 net/mptcp/protocol.c:1288
            __sys_setsockopt+0x152/0x240 net/socket.c:2130
            __do_sys_setsockopt net/socket.c:2146 [inline]
            __se_sys_setsockopt net/socket.c:2143 [inline]
            __x64_sys_setsockopt+0xba/0x150 net/socket.c:2143
            do_syscall_64+0xbd/0x5b0 arch/x86/entry/common.c:294
            entry_SYSCALL_64_after_hwframe+0x49/0xbe

     other info that might help us debug this:

      Possible unsafe locking scenario:

            CPU0                    CPU1
            ----                    ----
       lock(sk_lock-AF_INET);
                                    lock(rtnl_mutex);
                                    lock(sk_lock-AF_INET);
       lock(rtnl_mutex);

The lockdep complaint is because we hold mptcp socket lock when calling
the sk_prot get/setsockopt handler, and those might need to acquire the
rtnl mutex.  Normally, order is:

rtnl_lock(sk) -> lock_sock

Whereas for mptcp the order is

lock_sock(mptcp_sk) rtnl_lock -> lock_sock(subflow_sk)

We can avoid this by releasing the mptcp socket lock early, but, as Paolo
points out, we need to get/put the subflow socket refcount before doing so
to avoid race with concurrent close().

Fixes: 717e79c867 ("mptcp: Add setsockopt()/getsockopt() socket operations")
Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-29 17:45:19 +01:00
Florian Westphal
c9fd9c5f4b mptcp: defer freeing of cached ext until last moment
access to msk->cached_ext is only legal if the msk is locked or all
concurrent accesses are impossible.

Furthermore, once we start to tear down, we must make sure nothing else
can step in and allocate a new cached ext.

So place this code in the destroy callback where it belongs.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-29 17:45:19 +01:00
Joe Perches
793da4bfba sch_choke: Use kvcalloc
Convert the use of kvmalloc_array with __GFP_ZERO to
the equivalent kvcalloc.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-29 11:58:10 +01:00
David S. Miller
f6f7d8cf55 mptcp: Fix build with PROC_FS disabled.
net/mptcp/subflow.c: In function ‘mptcp_subflow_create_socket’:
net/mptcp/subflow.c:624:25: error: ‘struct netns_core’ has no member named ‘sock_inuse’

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-29 10:39:23 +01:00
Manish Mandlik
6c08fc896b Bluetooth: Fix refcount use-after-free issue
There is no lock preventing both l2cap_sock_release() and
chan->ops->close() from running at the same time.

If we consider Thread A running l2cap_chan_timeout() and Thread B running
l2cap_sock_release(), expected behavior is:
  A::l2cap_chan_timeout()->l2cap_chan_close()->l2cap_sock_teardown_cb()
  A::l2cap_chan_timeout()->l2cap_sock_close_cb()->l2cap_sock_kill()
  B::l2cap_sock_release()->sock_orphan()
  B::l2cap_sock_release()->l2cap_sock_kill()

where,
sock_orphan() clears "sk->sk_socket" and l2cap_sock_teardown_cb() marks
socket as SOCK_ZAPPED.

In l2cap_sock_kill(), there is an "if-statement" that checks if both
sock_orphan() and sock_teardown() has been run i.e. sk->sk_socket is NULL
and socket is marked as SOCK_ZAPPED. Socket is killed if the condition is
satisfied.

In the race condition, following occurs:
  A::l2cap_chan_timeout()->l2cap_chan_close()->l2cap_sock_teardown_cb()
  B::l2cap_sock_release()->sock_orphan()
  B::l2cap_sock_release()->l2cap_sock_kill()
  A::l2cap_chan_timeout()->l2cap_sock_close_cb()->l2cap_sock_kill()

In this scenario, "if-statement" is true in both B::l2cap_sock_kill() and
A::l2cap_sock_kill() and we hit "refcount: underflow; use-after-free" bug.

Similar condition occurs at other places where teardown/sock_kill is
happening:
  l2cap_disconnect_rsp()->l2cap_chan_del()->l2cap_sock_teardown_cb()
  l2cap_disconnect_rsp()->l2cap_sock_close_cb()->l2cap_sock_kill()

  l2cap_conn_del()->l2cap_chan_del()->l2cap_sock_teardown_cb()
  l2cap_conn_del()->l2cap_sock_close_cb()->l2cap_sock_kill()

  l2cap_disconnect_req()->l2cap_chan_del()->l2cap_sock_teardown_cb()
  l2cap_disconnect_req()->l2cap_sock_close_cb()->l2cap_sock_kill()

  l2cap_sock_cleanup_listen()->l2cap_chan_close()->l2cap_sock_teardown_cb()
  l2cap_sock_cleanup_listen()->l2cap_sock_kill()

Protect teardown/sock_kill and orphan/sock_kill by adding hold_lock on
l2cap channel to ensure that the socket is killed only after marked as
zapped and orphan.

Signed-off-by: Manish Mandlik <mmandlik@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-29 04:53:12 +01:00
Linus Torvalds
bd2463ac7d Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Add WireGuard

 2) Add HE and TWT support to ath11k driver, from John Crispin.

 3) Add ESP in TCP encapsulation support, from Sabrina Dubroca.

 4) Add variable window congestion control to TIPC, from Jon Maloy.

 5) Add BCM84881 PHY driver, from Russell King.

 6) Start adding netlink support for ethtool operations, from Michal
    Kubecek.

 7) Add XDP drop and TX action support to ena driver, from Sameeh
    Jubran.

 8) Add new ipv4 route notifications so that mlxsw driver does not have
    to handle identical routes itself. From Ido Schimmel.

 9) Add BPF dynamic program extensions, from Alexei Starovoitov.

10) Support RX and TX timestamping in igc, from Vinicius Costa Gomes.

11) Add support for macsec HW offloading, from Antoine Tenart.

12) Add initial support for MPTCP protocol, from Christoph Paasch,
    Matthieu Baerts, Florian Westphal, Peter Krystad, and many others.

13) Add Octeontx2 PF support, from Sunil Goutham, Geetha sowjanya, Linu
    Cherian, and others.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1469 commits)
  net: phy: add default ARCH_BCM_IPROC for MDIO_BCM_IPROC
  udp: segment looped gso packets correctly
  netem: change mailing list
  qed: FW 8.42.2.0 debug features
  qed: rt init valid initialization changed
  qed: Debug feature: ilt and mdump
  qed: FW 8.42.2.0 Add fw overlay feature
  qed: FW 8.42.2.0 HSI changes
  qed: FW 8.42.2.0 iscsi/fcoe changes
  qed: Add abstraction for different hsi values per chip
  qed: FW 8.42.2.0 Additional ll2 type
  qed: Use dmae to write to widebus registers in fw_funcs
  qed: FW 8.42.2.0 Parser offsets modified
  qed: FW 8.42.2.0 Queue Manager changes
  qed: FW 8.42.2.0 Expose new registers and change windows
  qed: FW 8.42.2.0 Internal ram offsets modifications
  MAINTAINERS: Add entry for Marvell OcteonTX2 Physical Function driver
  Documentation: net: octeontx2: Add RVU HW and drivers overview
  octeontx2-pf: ethtool RSS config support
  octeontx2-pf: Add basic ethtool support
  ...
2020-01-28 16:02:33 -08:00
Linus Torvalds
c677124e63 Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
 "These were the main changes in this cycle:

   - More -rt motivated separation of CONFIG_PREEMPT and
     CONFIG_PREEMPTION.

   - Add more low level scheduling topology sanity checks and warnings
     to filter out nonsensical topologies that break scheduling.

   - Extend uclamp constraints to influence wakeup CPU placement

   - Make the RT scheduler more aware of asymmetric topologies and CPU
     capacities, via uclamp metrics, if CONFIG_UCLAMP_TASK=y

   - Make idle CPU selection more consistent

   - Various fixes, smaller cleanups, updates and enhancements - please
     see the git log for details"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
  sched/fair: Define sched_idle_cpu() only for SMP configurations
  sched/topology: Assert non-NUMA topology masks don't (partially) overlap
  idle: fix spelling mistake "iterrupts" -> "interrupts"
  sched/fair: Remove redundant call to cpufreq_update_util()
  sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled
  sched/fair: Fix sgc->{min,max}_capacity calculation for SD_OVERLAP
  sched/fair: calculate delta runnable load only when it's needed
  sched/cputime: move rq parameter in irqtime_account_process_tick
  stop_machine: Make stop_cpus() static
  sched/debug: Reset watchdog on all CPUs while processing sysrq-t
  sched/core: Fix size of rq::uclamp initialization
  sched/uclamp: Fix a bug in propagating uclamp value in new cgroups
  sched/fair: Load balance aggressively for SCHED_IDLE CPUs
  sched/fair : Improve update_sd_pick_busiest for spare capacity case
  watchdog: Remove soft_lockup_hrtimer_cnt and related code
  sched/rt: Make RT capacity-aware
  sched/fair: Make EAS wakeup placement consider uclamp restrictions
  sched/fair: Make task_fits_capacity() consider uclamp restrictions
  sched/uclamp: Rename uclamp_util_with() into uclamp_rq_util_with()
  sched/uclamp: Make uclamp util helpers use and return UL values
  ...
2020-01-28 10:07:09 -08:00
Linus Torvalds
c0e809e244 Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull perf updates from Ingo Molnar:
 "Kernel side changes:

   - Ftrace is one of the last W^X violators (after this only KLP is
     left). These patches move it over to the generic text_poke()
     interface and thereby get rid of this oddity. This requires a
     surprising amount of surgery, by Peter Zijlstra.

   - x86/AMD PMUs: add support for 'Large Increment per Cycle Events' to
     count certain types of events that have a special, quirky hw ABI
     (by Kim Phillips)

   - kprobes fixes by Masami Hiramatsu

  Lots of tooling updates as well, the following subcommands were
  updated: annotate/report/top, c2c, clang, record, report/top TUI,
  sched timehist, tests; plus updates were done to the gtk ui, libperf,
  headers and the parser"

* 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
  perf/x86/amd: Add support for Large Increment per Cycle Events
  perf/x86/amd: Constrain Large Increment per Cycle events
  perf/x86/intel/rapl: Add Comet Lake support
  tracing: Initialize ret in syscall_enter_define_fields()
  perf header: Use last modification time for timestamp
  perf c2c: Fix return type for histogram sorting comparision functions
  perf beauty sockaddr: Fix augmented syscall format warning
  perf/ui/gtk: Fix gtk2 build
  perf ui gtk: Add missing zalloc object
  perf tools: Use %define api.pure full instead of %pure-parser
  libperf: Setup initial evlist::all_cpus value
  perf report: Fix no libunwind compiled warning break s390 issue
  perf tools: Support --prefix/--prefix-strip
  perf report: Clarify in help that --children is default
  tools build: Fix test-clang.cpp with Clang 8+
  perf clang: Fix build with Clang 9
  kprobes: Fix optimize_kprobe()/unoptimize_kprobe() cancellation logic
  tools lib: Fix builds when glibc contains strlcpy()
  perf report/top: Make 'e' visible in the help and make it toggle showing callchains
  perf report/top: Do not offer annotation for symbols without samples
  ...
2020-01-28 09:44:15 -08:00
Linus Torvalds
d99391ec2b Merge branch 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RCU updates from Ingo Molnar:
 "The RCU changes in this cycle were:
   - Expedited grace-period updates
   - kfree_rcu() updates
   - RCU list updates
   - Preemptible RCU updates
   - Torture-test updates
   - Miscellaneous fixes
   - Documentation updates"

* 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (69 commits)
  rcu: Remove unused stop-machine #include
  powerpc: Remove comment about read_barrier_depends()
  .mailmap: Add entries for old paulmck@kernel.org addresses
  srcu: Apply *_ONCE() to ->srcu_last_gp_end
  rcu: Switch force_qs_rnp() to for_each_leaf_node_cpu_mask()
  rcu: Move rcu_{expedited,normal} definitions into rcupdate.h
  rcu: Move gp_state_names[] and gp_state_getname() to tree_stall.h
  rcu: Remove the declaration of call_rcu() in tree.h
  rcu: Fix tracepoint tracking RCU CPU kthread utilization
  rcu: Fix harmless omission of "CONFIG_" from #if condition
  rcu: Avoid tick_dep_set_cpu() misordering
  rcu: Provide wrappers for uses of ->rcu_read_lock_nesting
  rcu: Use READ_ONCE() for ->expmask in rcu_read_unlock_special()
  rcu: Clear ->rcu_read_unlock_special only once
  rcu: Clear .exp_hint only when deferred quiescent state has been reported
  rcu: Rename some instance of CONFIG_PREEMPTION to CONFIG_PREEMPT_RCU
  rcu: Remove kfree_call_rcu_nobatch()
  rcu: Remove kfree_rcu() special casing and lazy-callback handling
  rcu: Add support for debug_objects debugging for kfree_rcu()
  rcu: Add multiple in-flight batches of kfree_rcu() work
  ...
2020-01-28 08:46:13 -08:00
Christophe JAILLET
151129df2f Bluetooth: SMP: Fix SALT value in some comments
Salts are 16 bytes long.
Remove some extra and erroneous '0' in the human readable format used
in comments.

Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-28 08:02:29 +01:00
Jeff Layton
24604f7e2b ceph: move net/ceph/ceph_fs.c to fs/ceph/util.c
All of these functions are only called from CephFS, so move them into
ceph.ko, and drop the exports.

Signed-off-by: Jeff Layton <jlayton@kernel.org>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
Luis Henriques
78beb0ff2f ceph: use copy-from2 op in copy_file_range
Instead of using the copy-from operation, switch copy_file_range to the
new copy-from2 operation, which allows to send the truncate_seq and
truncate_size parameters.

If an OSD does not support the copy-from2 operation it will return
-EOPNOTSUPP.  In that case, the kernel client will stop trying to do
remote object copies for this fs client and will always use the generic
VFS copy_file_range.

Signed-off-by: Luis Henriques <lhenriques@suse.com>
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2020-01-27 16:53:40 +01:00
David S. Miller
9e0703a265 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-01-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 20 non-merge commits during the last 5 day(s) which contain
a total of 24 files changed, 433 insertions(+), 104 deletions(-).

The main changes are:

1) Make BPF trampolines and dispatcher aware for the stack unwinder, from Jiri Olsa.

2) Improve handling of failed CO-RE relocations in libbpf, from Andrii Nakryiko.

3) Several fixes to BPF sockmap and reuseport selftests, from Lorenz Bauer.

4) Various cleanups in BPF devmap's XDP flush code, from John Fastabend.

5) Fix BPF flow dissector when used with port ranges, from Yoshiki Komachi.

6) Fix bpffs' map_seq_next callback to always inc position index, from Vasily Averin.

7) Allow overriding LLVM tooling for runqslower utility, from Andrey Ignatov.

8) Silence false-positive lockdep splats in devmap hash lookup, from Amol Grover.

9) Fix fentry/fexit selftests to initialize a variable before use, from John Sperbeck.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 14:31:40 +01:00
David S. Miller
c312840cd7 Revert "pktgen: Allow configuration of IPv6 source address range"
This reverts commit 7786a1af2a.

It causes build failures on 32-bit, for example:

   net/core/pktgen.o: In function `mod_cur_headers':
>> pktgen.c:(.text.mod_cur_headers+0xba0): undefined reference to `__umoddi3'

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 13:49:33 +01:00
Leon Romanovsky
6a7e25c7fb net/core: Replace driver version to be kernel version
In order to stop useless driver version bumps and unify output
presented by ethtool -i, let's set default version string.

As Linus said in [1]: "Things are supposed to be backwards and
forwards compatible, because we don't accept breakage in user
space anyway. So versioning is pointless, and only causes
problems."

They cause problems when users start to see version changes
and expect specific set of features which will be different
for stable@, vanilla and distribution kernels.

Distribution kernels are based on some kernel version with extra
patches on top, for example, in RedHat world this "extra" is a lot
and for them your driver version say nothing. Users who run vanilla
kernels won't use driver version information too, because running
such kernels requires knowledge and understanding.

Another set of problems are related to difference in versioning scheme
and such doesn't allow to write meaningful automation which will work
sanely on all ethtool capable devices.

Before this change:
[leonro@erver ~]$ ethtool -i eth0
driver: virtio_net
version: 1.0.0
After this change and once ->version assignment will be deleted
from virtio_net:
[leonro@server ~]$ ethtool -i eth0
driver: virtio_net
version: 5.5.0-rc6+

Link: https://lore.kernel.org/ksummit-discuss/CA+55aFx9A=5cc0QZ7CySC4F2K7eYaEfzkdYEc9JaNgCcV25=rg@mail.gmail.com/
Link: https://lore.kernel.org/linux-rdma/20200122152627.14903-1-michal.kalderon@marvell.com/T/#md460ff8f976c532a89d6860411c3c50bb811038b
Link: https://lore.kernel.org/linux-rdma/20200127060835.GA570@unicorn.suse.cz
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 13:47:22 +01:00
Michal Kubecek
67bffa7923 ethtool: add WOL_NTF notification
Send ETHTOOL_MSG_WOL_NTF notification whenever wake-on-lan settings of
a device are modified using ETHTOOL_MSG_WOL_SET netlink message or
ETHTOOL_SWOL ioctl request.

As notifications can be received by anyone, do not include SecureOn(tm)
password in notification messages.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:36 +01:00
Michal Kubecek
8d425b19b3 ethtool: set wake-on-lan settings with WOL_SET request
Implement WOL_SET netlink request to set wake-on-lan settings. This is
equivalent to ETHTOOL_SWOL ioctl request.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:36 +01:00
Michal Kubecek
51ea22b04e ethtool: provide WoL settings with WOL_GET request
Implement WOL_GET request to get wake-on-lan settings for a device,
traditionally available via ETHTOOL_GWOL ioctl request.

As part of the implementation, provide symbolic names for wake-on-line
modes as ETH_SS_WOL_MODES string set.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:36 +01:00
Michal Kubecek
0bda7af39d ethtool: add DEBUG_NTF notification
Send ETHTOOL_MSG_DEBUG_NTF notification message whenever debugging message
mask for a device are modified using ETHTOOL_MSG_DEBUG_SET netlink message
or ETHTOOL_SMSGLVL ioctl request.

The notification message has the same format as reply to DEBUG_GET request.
As with other ethtool notifications, netlink requests only trigger the
notification if the mask is actually changed while ioctl request trigger it
whenever the request results in calling the ethtool_ops handler.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:36 +01:00
Michal Kubecek
e54d04e3af ethtool: set message mask with DEBUG_SET request
Implement DEBUG_SET netlink request to set debugging settings for a device.
At the moment, only message mask corresponding to message level as set by
ETHTOOL_SMSGLVL ioctl request can be set. (It is called message level in
ioctl interface but almost all drivers interpret it as a bit mask.)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:35 +01:00
Michal Kubecek
6a94b8ccf6 ethtool: provide message mask with DEBUG_GET request
Implement DEBUG_GET request to get debugging settings for a device. At the
moment, only message mask corresponding to message level as reported by
ETHTOOL_GMSGLVL ioctl request is provided. (It is called message level in
ioctl interface but almost all drivers interpret it as a bit mask.)

As part of the implementation, provide symbolic names for message mask bits
as ETH_SS_MSG_CLASSES string set.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:35 +01:00
Michal Kubecek
d2c4b444fd ethtool: fix kernel-doc descriptions
Fix missing or incorrect function argument and struct member descriptions.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:31:35 +01:00
Yoshiki Komachi
59fb9b62fb flow_dissector: Fix to use new variables for port ranges in bpf hook
This patch applies new flag (FLOW_DISSECTOR_KEY_PORTS_RANGE) and
field (tp_range) to BPF flow dissector to generate appropriate flow
keys when classified by specified port ranges.

Fixes: 8ffb055bea ("cls_flower: Fix the behavior using port ranges with hw-offload")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Petar Penkov <ppenkov@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200117070533.402240-2-komachi.yoshiki@gmail.com
2020-01-27 11:25:07 +01:00
David S. Miller
c4c57b974d Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-01-26

Here's (probably) the last bluetooth-next pull request for the 5.6 kernel.

 - Initial pieces of Bluetooth 5.2 Isochronous Channels support
 - mgmt: Various cleanups and a new Set Blocked Keys command
 - btusb: Added support for 04ca:3021 QCA_ROME device
 - hci_qca: Multiple fixes & cleanups
 - hci_bcm: Fixes & improved device tree support
 - Fixed attempts to create duplicate debugfs entries

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:24:46 +01:00
Vladimir Oltean
6dc43cd3aa net: dsa: Fix use-after-free in probing of DSA switch tree
DSA sets up a switch tree little by little. Every switch of the N
members of the tree calls dsa_register_switch, and (N - 1) will just
touch the dst->ports list with their ports and quickly exit. Only the
last switch that calls dsa_register_switch will find all DSA links
complete in dsa_tree_setup_routing_table, and not return zero as a
result but instead go ahead and set up the entire DSA switch tree
(practically on behalf of the other switches too).

The trouble is that the (N - 1) switches don't clean up after themselves
after they get an error such as EPROBE_DEFER. Their footprint left in
dst->ports by dsa_switch_touch_ports is still there. And switch N, the
one responsible with actually setting up the tree, is going to work with
those stale dp, dp->ds and dp->ds->dev pointers. In particular ds and
ds->dev might get freed by the device driver.

Be there a 2-switch tree and the following calling order:
- Switch 1 calls dsa_register_switch
  - Calls dsa_switch_touch_ports, populates dst->ports
  - Calls dsa_port_parse_cpu, gets -EPROBE_DEFER, exits.
- Switch 2 calls dsa_register_switch
  - Calls dsa_switch_touch_ports, populates dst->ports
  - Probe doesn't get deferred, so it goes ahead.
  - Calls dsa_tree_setup_routing_table, which returns "complete == true"
    due to Switch 1 having called dsa_switch_touch_ports before.
  - Because the DSA links are complete, it calls dsa_tree_setup_switches
    now.
  - dsa_tree_setup_switches iterates through dst->ports, initializing
    the Switch 1 ds structure (invalid) and the Switch 2 ds structure
    (valid).
  - Undefined behavior (use after free, sometimes NULL pointers, etc).

Real example below (debugging prints added by me, as well as guards
against NULL pointers):

[    5.477947] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.313002] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.319932] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.329693] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.339458] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.349226] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.358991] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.368758] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.378524] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.388291] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.398057] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803df0b980 (dev ffffff803f775c00)
[    6.407912] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.417682] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.427446] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.437212] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.446979] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.456744] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.466512] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.476277] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.486043] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.495810] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.505577] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803da02f80 (dev 0000000000000000)
[    6.515433] dsa_tree_setup_switches: Setting up port 0 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.354120] dsa_tree_setup_switches: Setting up port 1 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.361045] dsa_tree_setup_switches: Setting up port 2 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.370805] dsa_tree_setup_switches: Setting up port 3 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.380571] dsa_tree_setup_switches: Setting up port 4 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.390337] dsa_tree_setup_switches: Setting up port 5 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.400104] dsa_tree_setup_switches: Setting up port 6 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.409872] dsa_tree_setup_switches: Setting up port 7 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.419637] dsa_tree_setup_switches: Setting up port 8 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.429403] dsa_tree_setup_switches: Setting up port 9 of switch ffffff803db15b80 (dev ffffff803d8e4800)
[    7.439169] dsa_tree_setup_switches: Setting up port 10 of switch ffffff803db15b80 (dev ffffff803d8e4800)

The solution is to recognize that the functions that call
dsa_switch_touch_ports (dsa_switch_parse_of, dsa_switch_parse) have side
effects, and therefore one should clean up their side effects on error
path. The cleanup of dst->ports was taken from dsa_switch_remove and
moved into a dedicated dsa_switch_release_ports function, which should
really be per-switch (free only the members of dst->ports that are also
members of ds, instead of all switch ports).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:12:46 +01:00
Heiner Kallweit
a85dd3a517 net: remove eth_change_mtu
All usage of this function was removed three years ago, and the
function was marked as deprecated:
a52ad514fd ("net: deprecate eth_change_mtu, remove usage")
So I think we can remove it now.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:09:31 +01:00
Jiri Pirko
93642e14bd net: introduce dev_net notifier register/unregister variants
Introduce dev_net variants of netdev notifier register/unregister functions
and allow per-net notifier to follow the netdevice into the namespace it is
moved to.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:03:44 +01:00
Jiri Pirko
1f637703d8 net: push code from net notifier reg/unreg into helpers
Push the code which is done under rtnl lock in net notifier register and
unregister function into separate helpers.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:03:44 +01:00
Jiri Pirko
48b3a1379f net: call call_netdevice_unregister_net_notifiers from unregister
The function does the same thing as the existing code, so rather call
call_netdevice_unregister_net_notifiers() instead of code duplication.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:03:44 +01:00
Kuniyuki Iwashima
cd94ef0639 soreuseport: Cleanup duplicate initialization of more_reuse->max_socks.
reuseport_grow() does not need to initialize the more_reuse->max_socks
again. It is already initialized in __reuseport_alloc().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:01:16 +01:00
Steffen Klassert
9fd1ff5d2a udp: Support UDP fraglist GRO/GSO.
This patch extends UDP GRO to support fraglist GRO/GSO
by using the previously introduced infrastructure.
If the feature is enabled, all UDP packets are going to
fraglist GRO (local input and forward).

After validating the csum,  we mark ip_summed as
CHECKSUM_UNNECESSARY for fraglist GRO packets to
make sure that the csum is not touched.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:00:21 +01:00
Steffen Klassert
3a1296a38d net: Support GRO/GSO fraglist chaining.
This patch adds the core functions to chain/unchain
GSO skbs at the frag_list pointer. This also adds
a new GSO type SKB_GSO_FRAGLIST and a is_flist
flag to napi_gro_cb which indicates that this
flow will be GROed by fraglist chaining.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:00:21 +01:00
Steffen Klassert
1a3c998f3a net: Add a netdev software feature set that defaults to off.
The previous patch added the NETIF_F_GRO_FRAGLIST feature.
This is a software feature that should default to off.
Current software features default to on, so add a new
feature set that defaults to off.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:00:21 +01:00
Steffen Klassert
3b33583265 net: Add fraglist GRO/GSO feature flags
This adds new Fraglist GRO/GSO feature flags. They will be used
to configure fraglist GRO/GSO what will be implemented with some
followup paches.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:00:21 +01:00
David Howells
122d74fac8 rxrpc: Fix use-after-free in rxrpc_receive_data()
The subpacket scanning loop in rxrpc_receive_data() references the
subpacket count in the private data part of the sk_buff in the loop
termination condition.  However, when the final subpacket is pasted into
the ring buffer, the function is no longer has a ref on the sk_buff and
should not be looking at sp->* any more.  This point is actually marked in
the code when skb is cleared (but sp is not - which is an error).

Fix this by caching sp->nr_subpackets in a local variable and using that
instead.

Also clear 'sp' to catch accesses after that point.

This can show up as an oops in rxrpc_get_skb() if sp->nr_subpackets gets
trashed by the sk_buff getting freed and reused in the meantime.

Fixes: e2de6c4048 ("rxrpc: Use info in skbuff instead of reparsing a jumbo packet")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 10:56:30 +01:00
Eric Dumazet
55cd9f67f1 net_sched: ematch: reject invalid TCF_EM_SIMPLE
It is possible for malicious userspace to set TCF_EM_SIMPLE bit
even for matches that should not have this bit set.

This can fool two places using tcf_em_is_simple()

1) tcf_em_tree_destroy() -> memory leak of em->data
   if ops->destroy() is NULL

2) tcf_em_tree_dump() wrongly report/leak 4 low-order bytes
   of a kernel pointer.

BUG: memory leak
unreferenced object 0xffff888121850a40 (size 32):
  comm "syz-executor927", pid 7193, jiffies 4294941655 (age 19.840s)
  hex dump (first 32 bytes):
    00 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<00000000f67036ea>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<00000000f67036ea>] slab_post_alloc_hook mm/slab.h:586 [inline]
    [<00000000f67036ea>] slab_alloc mm/slab.c:3320 [inline]
    [<00000000f67036ea>] __do_kmalloc mm/slab.c:3654 [inline]
    [<00000000f67036ea>] __kmalloc_track_caller+0x165/0x300 mm/slab.c:3671
    [<00000000fab0cc8e>] kmemdup+0x27/0x60 mm/util.c:127
    [<00000000d9992e0a>] kmemdup include/linux/string.h:453 [inline]
    [<00000000d9992e0a>] em_nbyte_change+0x5b/0x90 net/sched/em_nbyte.c:32
    [<000000007e04f711>] tcf_em_validate net/sched/ematch.c:241 [inline]
    [<000000007e04f711>] tcf_em_tree_validate net/sched/ematch.c:359 [inline]
    [<000000007e04f711>] tcf_em_tree_validate+0x332/0x46f net/sched/ematch.c:300
    [<000000007a769204>] basic_set_parms net/sched/cls_basic.c:157 [inline]
    [<000000007a769204>] basic_change+0x1d7/0x5f0 net/sched/cls_basic.c:219
    [<00000000e57a5997>] tc_new_tfilter+0x566/0xf70 net/sched/cls_api.c:2104
    [<0000000074b68559>] rtnetlink_rcv_msg+0x3b2/0x4b0 net/core/rtnetlink.c:5415
    [<00000000b7fe53fb>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
    [<00000000e83a40d0>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
    [<00000000d62ba933>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    [<00000000d62ba933>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
    [<0000000088070f72>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
    [<00000000f70b15ea>] sock_sendmsg_nosec net/socket.c:639 [inline]
    [<00000000f70b15ea>] sock_sendmsg+0x54/0x70 net/socket.c:659
    [<00000000ef95a9be>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
    [<00000000b650f1ab>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
    [<0000000055bfa74a>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
    [<000000002abac183>] __do_sys_sendmsg net/socket.c:2426 [inline]
    [<000000002abac183>] __se_sys_sendmsg net/socket.c:2424 [inline]
    [<000000002abac183>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot+03c4738ed29d5d366ddf@syzkaller.appspotmail.com
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 10:55:26 +01:00
Stephen Worley
f9e9555575 net: include struct nhmsg size in nh nlmsg size
Include the size of struct nhmsg size when calculating
how much of a payload to allocate in a new netlink nexthop
notification message.

Without this, we will fail to fill the skbuff at certain nexthop
group sizes.

You can reproduce the failure with the following iproute2 commands:

ip link add dummy1 type dummy
ip link add dummy2 type dummy
ip link add dummy3 type dummy
ip link add dummy4 type dummy
ip link add dummy5 type dummy
ip link add dummy6 type dummy
ip link add dummy7 type dummy
ip link add dummy8 type dummy
ip link add dummy9 type dummy
ip link add dummy10 type dummy
ip link add dummy11 type dummy
ip link add dummy12 type dummy
ip link add dummy13 type dummy
ip link add dummy14 type dummy
ip link add dummy15 type dummy
ip link add dummy16 type dummy
ip link add dummy17 type dummy
ip link add dummy18 type dummy
ip link add dummy19 type dummy

ip ro add 1.1.1.1/32 dev dummy1
ip ro add 1.1.1.2/32 dev dummy2
ip ro add 1.1.1.3/32 dev dummy3
ip ro add 1.1.1.4/32 dev dummy4
ip ro add 1.1.1.5/32 dev dummy5
ip ro add 1.1.1.6/32 dev dummy6
ip ro add 1.1.1.7/32 dev dummy7
ip ro add 1.1.1.8/32 dev dummy8
ip ro add 1.1.1.9/32 dev dummy9
ip ro add 1.1.1.10/32 dev dummy10
ip ro add 1.1.1.11/32 dev dummy11
ip ro add 1.1.1.12/32 dev dummy12
ip ro add 1.1.1.13/32 dev dummy13
ip ro add 1.1.1.14/32 dev dummy14
ip ro add 1.1.1.15/32 dev dummy15
ip ro add 1.1.1.16/32 dev dummy16
ip ro add 1.1.1.17/32 dev dummy17
ip ro add 1.1.1.18/32 dev dummy18
ip ro add 1.1.1.19/32 dev dummy19

ip next add id 1 via 1.1.1.1 dev dummy1
ip next add id 2 via 1.1.1.2 dev dummy2
ip next add id 3 via 1.1.1.3 dev dummy3
ip next add id 4 via 1.1.1.4 dev dummy4
ip next add id 5 via 1.1.1.5 dev dummy5
ip next add id 6 via 1.1.1.6 dev dummy6
ip next add id 7 via 1.1.1.7 dev dummy7
ip next add id 8 via 1.1.1.8 dev dummy8
ip next add id 9 via 1.1.1.9 dev dummy9
ip next add id 10 via 1.1.1.10 dev dummy10
ip next add id 11 via 1.1.1.11 dev dummy11
ip next add id 12 via 1.1.1.12 dev dummy12
ip next add id 13 via 1.1.1.13 dev dummy13
ip next add id 14 via 1.1.1.14 dev dummy14
ip next add id 15 via 1.1.1.15 dev dummy15
ip next add id 16 via 1.1.1.16 dev dummy16
ip next add id 17 via 1.1.1.17 dev dummy17
ip next add id 18 via 1.1.1.18 dev dummy18
ip next add id 19 via 1.1.1.19 dev dummy19

ip next add id 1111 group 1/2/3/4/5/6/7/8/9/10/11/12/13/14/15/16/17/18/19
ip next del id 1111

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Stephen Worley <sworley@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 10:54:08 +01:00
Cong Wang
760d228e32 net_sched: walk through all child classes in tc_bind_tclass()
In a complex TC class hierarchy like this:

tc qdisc add dev eth0 root handle 1:0 cbq bandwidth 100Mbit         \
  avpkt 1000 cell 8
tc class add dev eth0 parent 1:0 classid 1:1 cbq bandwidth 100Mbit  \
  rate 6Mbit weight 0.6Mbit prio 8 allot 1514 cell 8 maxburst 20      \
  avpkt 1000 bounded

tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
  sport 80 0xffff flowid 1:3
tc filter add dev eth0 parent 1:0 protocol ip prio 1 u32 match ip \
  sport 25 0xffff flowid 1:4

tc class add dev eth0 parent 1:1 classid 1:3 cbq bandwidth 100Mbit  \
  rate 5Mbit weight 0.5Mbit prio 5 allot 1514 cell 8 maxburst 20      \
  avpkt 1000
tc class add dev eth0 parent 1:1 classid 1:4 cbq bandwidth 100Mbit  \
  rate 3Mbit weight 0.3Mbit prio 5 allot 1514 cell 8 maxburst 20      \
  avpkt 1000

where filters are installed on qdisc 1:0, so we can't merely
search from class 1:1 when creating class 1:3 and class 1:4. We have
to walk through all the child classes of the direct parent qdisc.
Otherwise we would miss filters those need reverse binding.

Fixes: 07d79fc7d9 ("net_sched: add reverse binding for tc class")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 10:52:46 +01:00
Cong Wang
2e24cd7555 net_sched: fix ops->bind_class() implementations
The current implementations of ops->bind_class() are merely
searching for classid and updating class in the struct tcf_result,
without invoking either of cl_ops->bind_tcf() or
cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
like cbq use them to count filters too. This is why syzbot triggered
the warning in cbq_destroy_class().

In order to fix this, we have to call cl_ops->bind_tcf() and
cl_ops->unbind_tcf() like the filter binding path. This patch does
so by refactoring out two helper functions __tcf_bind_filter()
and __tcf_unbind_filter(), which are lockless and accept a Qdisc
pointer, then teaching each implementation to call them correctly.

Note, we merely pass the Qdisc pointer as an opaque pointer to
each filter, they only need to pass it down to the helper
functions without understanding it at all.

Fixes: 07d79fc7d9 ("net_sched: add reverse binding for tc class")
Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 10:51:43 +01:00
Stefano Brivio
3c4287f620 nf_tables: Add set type for arbitrary concatenation of ranges
This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
	http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
	https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10190076pps
    baseline hash (non-ranged entries):             6179564pps
    baseline rbtree (match on first field only):    2950341pps
    set with  1000 full, ranged entries:            2304165pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10143615pps
    baseline hash (non-ranged entries):             6135776pps
    baseline rbtree (match on first field only):    4311934pps
    set with   100 full, ranged entries:            4131471pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9730404pps
    baseline hash (non-ranged entries):             4809557pps
    baseline rbtree (match on first field only):    1501699pps
    set with  1000 full, ranged entries:            1092557pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10812426pps
    baseline hash (non-ranged entries):             6929353pps
    baseline rbtree (match on first field only):    3027105pps
    set with 30000 full, ranged entries:             284147pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9660114pps
    baseline hash (non-ranged entries):             3778877pps
    baseline rbtree (match on first field only):    3179379pps
    set with    10 full, ranged entries:            2082880pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9718324pps
    baseline hash (non-ranged entries):             3799021pps
    baseline rbtree (match on first field only):    1506689pps
    set with  1000 full, ranged entries:             783810pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10190029pps
    baseline hash (non-ranged entries):             5172218pps
    baseline rbtree (match on first field only):    2946863pps
    set with  1000 full, ranged entries:            1279122pps

v4:
 - fix build for 32-bit architectures: 64-bit division needs
   div_u64() (kbuild test robot <lkp@intel.com>)
v3:
 - rework interface for field length specification,
   NFT_SET_SUBKEY disappears and information is stored in
   description
 - remove scratch area to store closing element of ranges,
   as elements now come with an actual attribute to specify
   the upper range limit (Pablo Neira Ayuso)
 - also remove pointer to 'start' element from mapping table,
   closing key is now accessible via extension data
 - use bytes right away instead of bits for field lengths,
   this way we can also double the inner loop of the lookup
   function to take care of upper and lower bits in a single
   iteration (minor performance improvement)
 - make it clearer that set operations are actually atomic
   API-wise, but we can't e.g. implement flush() as one-shot
   action
 - fix type for 'dup' in nft_pipapo_insert(), check for
   duplicates only in the next generation, and in general take
   care of differentiating generation mask cases depending on
   the operation (Pablo Neira Ayuso)
 - report C implementation matching rate in commit message, so
   that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
 - protect access to scratch maps in nft_pipapo_lookup() with
   local_bh_disable/enable() (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
   already implied (Florian Westphal)
 - explain why partial allocation failures don't need handling
   in pipapo_realloc_scratch(), rename 'm' to clone and update
   related kerneldoc to make it clear we're not operating on
   the live copy (Florian Westphal)
 - add expicit check for priv->start_elem in
   nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
   with a NULL start element, and also zero it out in every
   operation that might make it invalid, so that insertion
   doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
Stefano Brivio
f3a2181e16 netfilter: nf_tables: Support for sets with multiple ranged fields
Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.

This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.

In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.

For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:

  NFTA_SET_ELEM_KEY:            192.0.2.0 . 22
  NFTA_SET_ELEM_KEY_END:        192.0.2.42 . 25

and NFTA_SET_DESC_CONCAT attribute would contain:

  NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN:		4
  NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN:		2

v4: No changes
v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
v2: No changes

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
Pablo Neira Ayuso
7b225d0b5c netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute
Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
interval between kernel and userspace.

This patch also adds the NFT_SET_EXT_KEY_END extension to store the
closing element value in this interval.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; add corresponding
  nft_set_ext_type for new key; rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
Pablo Neira Ayuso
20a1452c35 netfilter: nf_tables: add nft_setelem_parse_key()
Add helper function to parse the set element key netlink attribute.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; use NFT_DATA_VALUE_MAXLEN
  instead of sizeof(*key) in helper, value can be longer than that;
  rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
Abdul Kabbani
32efcc06d2 tcp: export count for rehash attempts
Using IPv6 flow-label to swiftly route around avoid congested or
disconnected network path can greatly improve TCP reliability.

This patch adds SNMP counters and a OPT_STATS counter to track both
host-level and connection-level statistics. Network administrators
can use these counters to evaluate the impact of this new ability better.

Export count for rehash attempts to
1) two SNMP counters: TcpTimeoutRehash (rehash due to timeouts),
   and TcpDuplicateDataRehash (rehash due to receiving duplicate
   packets)
2) Timestamping API SOF_TIMESTAMPING_OPT_STATS.

Signed-off-by: Abdul Kabbani <akabbani@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 15:28:47 +01:00
David S. Miller
4d8773b68e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in mlx5 because changes happened to code that has
moved meanwhile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 10:40:21 +01:00
Dan Carpenter
11eb85ec42 Bluetooth: Fix race condition in hci_release_sock()
Syzbot managed to trigger a use after free "KASAN: use-after-free Write
in hci_sock_bind".  I have reviewed the code manually and one possibly
cause I have found is that we are not holding lock_sock(sk) when we do
the hci_dev_put(hdev) in hci_sock_release().  My theory is that the bind
and the release are racing against each other which results in this use
after free.

Reported-by: syzbot+eba992608adf3d796bcc@syzkaller.appspotmail.com
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-01-26 10:34:17 +02:00
David S. Miller
6badad1c1d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing netlink attribute sanity check for NFTA_OSF_DREG,
   from Florian Westphal.

2) Use bitmap infrastructure in ipset to fix KASAN slab-out-of-bounds
   reads, from Jozsef Kadlecsik.

3) Missing initial CLOSED state in new sctp connection through
   ctnetlink events, from Jiri Wiesner.

4) Missing check for NFT_CHAIN_HW_OFFLOAD in nf_tables offload
   indirect block infrastructure, from wenxu.

5) Add __nft_chain_type_get() to sanity check family and chain type.

6) Autoload modules from the nf_tables abort path to fix races
   reported by syzbot.

7) Remove unnecessary skb->csum update on inet_proto_csum_replace16(),
   from Praveen Chaudhary.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 21:40:39 +01:00
Marcel Holtmann
18f81241b7 Bluetooth: Move {min,max}_key_size debugfs into hci_debugfs_create_le
The debugfs entries for {min,max}_key_size are created during SMP
registration and thus it might lead to multiple attempts to create the
same entries. Avoid this by moving them to the LE controller init
section.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-01-25 16:33:52 +02:00
Marcel Holtmann
cc97400361 Bluetooth: Add missing checks for HCI_ISODATA_PKT packet type
The checks for HCI_ISODATA_PKT packet type are required in a few
additional locations to allow sending/receiving of this new packet type.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-01-25 16:33:46 +02:00
Petr Machata
ef6aadcc76 net: sched: Make TBF Qdisc offloadable
Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying
and dumping of TBF Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Petr Machata
c207015274 net: sched: sch_tbf: Don't overwrite backlog before dumping
In 2011, in commit b0460e4484 ("sch_tbf: report backlog information"),
TBF started copying backlog depth from the child Qdisc before dumping, with
the motivation that the backlog was otherwise not visible in "tc -s qdisc
show".

Later, in 2016, in commit 8d5958f424 ("sch_tbf: update backlog as well"),
TBF got a full-blown backlog tracking. However it kept copying the child's
backlog over before dumping.

That line is now unnecessary, so remove it.

As shown in the following example, backlog is still reported correctly:

    # tc -s qdisc show dev veth0 invisible
    qdisc tbf 1: root refcnt 2 rate 1Mbit burst 128Kb lat 82.8s
     Sent 505475370 bytes 406985 pkt (dropped 0, overlimits 812544 requeues 0)
     backlog 81972b 66p requeues 0
    qdisc bfifo 0: parent 1:1 limit 10Mb
     Sent 505475370 bytes 406985 pkt (dropped 0, overlimits 0 requeues 0)
     backlog 81972b 66p requeues 0

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:30 +01:00
Moshe Shemesh
6ec8b6cd79 devlink: Add health recover notifications on devlink flows
Devlink health recover notifications were added only on driver direct
updates of health_state through devlink_health_reporter_state_update().
Add notifications on updates of health_state by devlink flows of report
and recover.

Moved functions devlink_nl_health_reporter_fill() and
devlink_recover_notify() to avoid forward declaration.

Fixes: 97ff3bd37f ("devlink: add devink notification when reporter update health state")
Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:34:42 +01:00
Andy Shevchenko
79ac522402 net: atm: use %*ph to print small buffer
Use %*ph format to print small buffer as hex string.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:14:40 +01:00
Ingo Molnar
f8a4bb6bfa Merge branch 'for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu into core/rcu
Pull RCU updates from Paul E. McKenney:

 - Expedited grace-period updates
 - kfree_rcu() updates
 - RCU list updates
 - Preemptible RCU updates
 - Torture-test updates
 - Miscellaneous fixes
 - Documentation updates

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-25 10:05:23 +01:00
Mat Martineau
edc7e4898d mptcp: Fix code formatting
checkpatch.pl had a few complaints in the last set of MPTCP patches:

ERROR: code indent should use tabs where possible
+^I         subflow, sk->sk_family, icsk->icsk_af_ops, target, mapped);$

CHECK: Comparison to NULL could be written "!new_ctx"
+	if (new_ctx == NULL) {

ERROR: "foo * bar" should be "foo *bar"
+static const struct proto_ops * tcp_proto_ops(struct sock *sk)

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 08:14:39 +01:00
Florian Westphal
e42f1ac626 mptcp: do not inherit inet proto ops
We need to initialise the struct ourselves, else we expose tcp-specific
callbacks such as tcp_splice_read which will then trigger splat because
the socket is an mptcp one:

BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478

CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3
Call Trace:
 tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
 tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612
 tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674
 tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791
 do_splice+0x1259/0x1560 fs/splice.c:1205

To prevent build error with ipv6, add the recv/sendmsg function
declaration to ipv6.h.  The functions are already accessible "thanks"
to retpoline related work, but they are currently only made visible
by socket.c specific INDIRECT_CALLABLE macros.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 08:14:39 +01:00
Praveen Chaudhary
189c9b1e94 net: Fix skb->csum update in inet_proto_csum_replace16().
skb->csum is updated incorrectly, when manipulation for
NF_NAT_MANIP_SRC\DST is done on IPV6 packet.

Fix:
There is no need to update skb->csum in inet_proto_csum_replace16(),
because update in two fields a.) IPv6 src/dst address and b.) L4 header
checksum cancels each other for skb->csum calculation. Whereas
inet_proto_csum_replace4 function needs to update skb->csum, because
update in 3 fields a.) IPv4 src/dst address, b.) IPv4 Header checksum
and c.) L4 header checksum results in same diff as L4 Header checksum
for skb->csum calculation.

[ pablo@netfilter.org: a few comestic documentation edits ]
Signed-off-by: Praveen Chaudhary <pchaudhary@linkedin.com>
Signed-off-by: Zhenggen Xu <zxu@linkedin.com>
Signed-off-by: Andy Stracner <astracner@linkedin.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:30 +01:00
Pablo Neira Ayuso
eb014de4fd netfilter: nf_tables: autoload modules from the abort path
This patch introduces a list of pending module requests. This new module
list is composed of nft_module_request objects that contain the module
name and one status field that tells if the module has been already
loaded (the 'done' field).

In the first pass, from the preparation phase, the netlink command finds
that a module is missing on this list. Then, a module request is
allocated and added to this list and nft_request_module() returns
-EAGAIN. This triggers the abort path with the autoload parameter set on
from nfnetlink, request_module() is called and the module request enters
the 'done' state. Since the mutex is released when loading modules from
the abort phase, the module list is zapped so this is iteration occurs
over a local list. Therefore, the request_module() calls happen when
object lists are in consistent state (after fulling aborting the
transaction) and the commit list is empty.

On the second pass, the netlink command will find that it already tried
to load the module, so it does not request it again and
nft_request_module() returns 0. Then, there is a look up to find the
object that the command was missing. If the module was successfully
loaded, the command proceeds normally since it finds the missing object
in place, otherwise -ENOENT is reported to userspace.

This patch also updates nfnetlink to include the reason to enter the
abort phase, which is required for this new autoload module rationale.

Fixes: ec7470b834 ("netfilter: nf_tables: store transaction list locally while requesting module")
Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:29 +01:00
Pablo Neira Ayuso
826035498e netfilter: nf_tables: add __nft_chain_type_get()
This new helper function validates that unknown family and chain type
coming from userspace do not trigger an out-of-bound array access. Bail
out in case __nft_chain_type_get() returns NULL from
nft_chain_parse_hook().

Fixes: 9370761c56 ("netfilter: nf_tables: convert built-in tables/chains to chain types")
Reported-by: syzbot+156a04714799b1d480bc@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:28 +01:00
wenxu
c83de17dd6 netfilter: nf_tables_offload: fix check the chain offload flag
In the nft_indr_block_cb the chain should check the flag with
NFT_CHAIN_HW_OFFLOAD.

Fixes: 9a32669fec ("netfilter: nf_tables_offload: support indr block call")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:11 +01:00
Jiri Wiesner
ab658b9fa7 netfilter: conntrack: sctp: use distinct states for new SCTP connections
The netlink notifications triggered by the INIT and INIT_ACK chunks
for a tracked SCTP association do not include protocol information
for the corresponding connection - SCTP state and verification tags
for the original and reply direction are missing. Since the connection
tracking implementation allows user space programs to receive
notifications about a connection and then create a new connection
based on the values received in a notification, it makes sense that
INIT and INIT_ACK notifications should contain the SCTP state
and verification tags available at the time when a notification
is sent. The missing verification tags cause a newly created
netfilter connection to fail to verify the tags of SCTP packets
when this connection has been created from the values previously
received in an INIT or INIT_ACK notification.

A PROTOINFO event is cached in sctp_packet() when the state
of a connection changes. The CLOSED and COOKIE_WAIT state will
be used for connections that have seen an INIT and INIT_ACK chunk,
respectively. The distinct states will cause a connection state
change in sctp_packet().

Signed-off-by: Jiri Wiesner <jwiesner@suse.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 18:26:53 +01:00
Paolo Abeni
8ab183deb2 mptcp: cope with later TCP fallback
With MPTCP v1, passive connections can fallback to TCP after the
subflow becomes established:

syn + MP_CAPABLE ->
               <- syn, ack + MP_CAPABLE

ack, seq = 3    ->
        // OoO packet is accepted because in-sequence
        // passive socket is created, is in ESTABLISHED
	// status and tentatively as MP_CAPABLE

ack, seq = 2     ->
        // no MP_CAPABLE opt, subflow should fallback to TCP

We can't use the 'subflow' socket fallback, as we don't have
it available for passive connection.

Instead, when the fallback is detected, replace the mptcp
socket with the underlying TCP subflow. Beyond covering
the above scenario, it makes a TCP fallback socket as efficient
as plain TCP ones.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Christoph Paasch
d22f4988ff mptcp: process MP_CAPABLE data option
This patch implements the handling of MP_CAPABLE + data option, as per
RFC 6824 bis / RFC 8684: MPTCP v1.

On the server side we can receive the remote key after that the connection
is established. We need to explicitly track the 'missing remote key'
status and avoid emitting a mptcp ack until we get such info.

When a late/retransmitted/OoO pkt carrying MP_CAPABLE[+data] option
is received, we have to propagate the mptcp seq number info to
the msk socket. To avoid ABBA locking issue, explicitly check for
that in recvmsg(), where we own msk and subflow sock locks.

The above also means that an established mp_capable subflow - still
waiting for the remote key - can be 'downgraded' to plain TCP.

Such change could potentially block a reader waiting for new data
forever - as they hook to msk, while later wake-up after the downgrade
will be on subflow only.

The above issue is not handled here, we likely have to get rid of
msk->fallback to handle that cleanly.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Christoph Paasch
cc7972ea19 mptcp: parse and emit MP_CAPABLE option according to v1 spec
This implements MP_CAPABLE options parsing and writing according
to RFC 6824 bis / RFC 8684: MPTCP v1.

Local key is sent on syn/ack, and both keys are sent on 3rd ack.
MP_CAPABLE messages len are updated accordingly. We need the skbuff to
correctly emit the above, so we push the skbuff struct as an argument
all the way from tcp code to the relevant mptcp callbacks.

When processing incoming MP_CAPABLE + data, build a full blown DSS-like
map info, to simplify later processing.  On child socket creation, we
need to record the remote key, if available.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
65492c5a6a mptcp: move from sha1 (v0) to sha256 (v1)
For simplicity's sake use directly sha256 primitives (and pull them
as a required build dep).
Add optional, boot-time self-tests for the hmac function.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Matthieu Baerts
784325e9f0 mptcp: new sysctl to control the activation per NS
New MPTCP sockets will return -ENOPROTOOPT if MPTCP support is disabled
for the current net namespace.

We are providing here a way to control access to the feature for those
that need to turn it on or off.

The value of this new sysctl can be different per namespace. We can then
restrict the usage of MPTCP to the selected NS. In case of serious
issues with MPTCP, administrators can now easily turn MPTCP off.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
57040755a3 mptcp: allow collapsing consecutive sendpages on the same substream
If the current sendmsg() lands on the same subflow we used last, we
can try to collapse the data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Paolo Abeni
7a6a6cbc3e mptcp: recvmsg() can drain data from multiple subflows
With the previous patch in place, the msk can detect which subflow
has the current map with a simple walk, let's update the main
loop to always select the 'current' subflow. The exit conditions now
closely mirror tcp_recvmsg() to get expected timeout and signal
behavior.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Florian Westphal
1891c4a076 mptcp: add subflow write space signalling and mptcp_poll
Add new SEND_SPACE flag to indicate that a subflow has enough space to
accept more data for transmission.

It gets cleared at the end of mptcp_sendmsg() in case ssk has run
below the free watermark.

It is (re-set) from the wspace callback.

This allows us to use msk->flags to determine the poll mask.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Mat Martineau
648ef4b886 mptcp: Implement MPTCP receive path
Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

The subflow implements its own data_ready() ops, which ensures that the
pending data is in sequence - according to MPTCP seq number - dropping
out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
This allows the MPTCP socket layer to determine if more data is
available without having to consult the individual subflows.

It additionally validates the current mapping and propagates EoF events
to the connection socket.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau
6d0060f600 mptcp: Write MPTCP DSS headers to outgoing data packets
Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, which is copied to page fragments and mapped in to MPTCP
DSS segments with size determined by the available page fragments and
the maximum mapping length allowed by the MPTCP specification. If
do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok -
the later skbs can either have duplicated DSS mapping information or
none at all, and the receiver can handle that.

The current implementation uses the subflow frag cache and tcp
sendpages to avoid excessive code duplication. More work is required to
ensure that it works correctly under memory pressure and to support
MPTCP-level retransmissions.

The MPTCP DSS checksum is not yet implemented.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
717e79c867 mptcp: Add setsockopt()/getsockopt() socket operations
set/getsockopt behaviour with multiple subflows is undefined.
Therefore, for now, we return -EOPNOTSUPP unless we're in fallback mode.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
214984901a mptcp: Add shutdown() socket operation
Call shutdown on all subflows in use on the given socket, or on the
fallback socket.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
79c0949e9a mptcp: Add key generation and token tree
Generate the local keys, IDSN, and token when creating a new socket.
Introduce the token tree to track all tokens in use using a radix tree
with the MPTCP token itself as the index.

Override the rebuild_header callback in inet_connection_sock_af_ops for
creating the local key on a new outgoing connection.

Override the init_req callback of tcp_request_sock_ops for creating the
local key on a new incoming connection.

Will be used to obtain the MPTCP parent socket to handle incoming joins.

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
cf7da0d66c mptcp: Create SUBFLOW socket for incoming connections
Add subflow_request_sock type that extends tcp_request_sock
and add an is_mptcp flag to tcp_request_sock distinguish them.

Override the listen() and accept() methods of the MPTCP
socket proto_ops so they may act on the subflow socket.

Override the conn_request() and syn_recv_sock() handlers
in the inet_connection_sock to handle incoming MPTCP
SYNs and the ACK to the response SYN.

Add handling in tcp_output.c to add MP_CAPABLE to an outgoing
SYN-ACK response for a subflow_request_sock.

Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
cec37a6e41 mptcp: Handle MP_CAPABLE options for outgoing connections
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
2303f994b3 mptcp: Associate MPTCP context with TCP socket
Use ULP to associate a subflow_context structure with each TCP subflow
socket. Creating these sockets requires new bind and connect functions
to make sure ULP is set up immediately when the subflow sockets are
created.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
eda7acddf8 mptcp: Handle MPTCP TCP options
Add hooks to parse and format the MP_CAPABLE option.

This option is handled according to MPTCP version 0 (RFC6824).
MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
coordination with related code changes.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau
f870fa0b57 mptcp: Add MPTCP socket stubs
Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Nikolay Aleksandrov
a580c76d53 net: bridge: vlan: add per-vlan state
The first per-vlan option added is state, it is needed for EVPN and for
per-vlan STP. The state allows to control the forwarding on per-vlan
basis. The vlan state is considered only if the port state is forwarding
in order to avoid conflicts and be consistent. br_allowed_egress is
called only when the state is forwarding, but the ingress case is a bit
more complicated due to the fact that we may have the transition between
port:BR_STATE_FORWARDING -> vlan:BR_STATE_LEARNING which should still
allow the bridge to learn from the packet after vlan filtering and it will
be dropped after that. Also to optimize the pvid state check we keep a
copy in the vlan group to avoid one lookup. The state members are
modified with *_ONCE() to annotate the lockless access.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 12:58:14 +01:00
Nikolay Aleksandrov
a5d29ae226 net: bridge: vlan: add basic option setting support
This patch adds support for option modification of single vlans and
ranges. It allows to only modify options, i.e. skip create/delete by
using the BRIDGE_VLAN_INFO_ONLY_OPTS flag. When working with a range
option changes we try to pack the notifications as much as possible.

v2: do full port (all vlans) notification only when creating/deleting
    vlans for compatibility, rework the range detection when changing
    options, add more verbose extack errors and check if a vlan should
    be used (br_vlan_should_use checks)

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 12:58:14 +01:00
Nikolay Aleksandrov
7a53e718c5 net: bridge: vlan: add basic option dumping support
We'll be dumping the options for the whole range if they're equal. The
first range vlan will be used to extract the options. The commit doesn't
change anything yet it just adds the skeleton for the support. The dump
will happen when the first option is added.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 12:58:14 +01:00
Nikolay Aleksandrov
ac0e932d0e net: bridge: check port state before br_allowed_egress
If we make sure that br_allowed_egress is called only when we have
BR_STATE_FORWARDING state then we can avoid a test later when we add
per-vlan state.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 12:58:14 +01:00
Vasily Averin
4fc427e051 ipv6_route_seq_next should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 11:42:18 +01:00
Vasily Averin
a3ea86739f rt_cpu_seq_next should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 11:42:18 +01:00
Vasily Averin
1e3f9f073c neigh_stat_seq_next() should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 11:42:18 +01:00
Vasily Averin
8bf7092021 vcc_seq_next should increase position index
if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin <vvs@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 11:42:18 +01:00
Eric Dumazet
2bec445f9b tcp: do not leave dangling pointers in tp->highest_sack
Latest commit 853697504d ("tcp: Fix highest_sack and highest_sack_seq")
apparently allowed syzbot to trigger various crashes in TCP stack [1]

I believe this commit only made things easier for syzbot to find
its way into triggering use-after-frees. But really the bugs
could lead to bad TCP behavior or even plain crashes even for
non malicious peers.

I have audited all calls to tcp_rtx_queue_unlink() and
tcp_rtx_queue_unlink_and_free() and made sure tp->highest_sack would be updated
if we are removing from rtx queue the skb that tp->highest_sack points to.

These updates were missing in three locations :

1) tcp_clean_rtx_queue() [This one seems quite serious,
                          I have no idea why this was not caught earlier]

2) tcp_rtx_queue_purge() [Probably not a big deal for normal operations]

3) tcp_send_synack()     [Probably not a big deal for normal operations]

[1]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
BUG: KASAN: use-after-free in tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
BUG: KASAN: use-after-free in tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
Read of size 4 at addr ffff8880a488d068 by task ksoftirqd/1/16

CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:639
 __asan_report_load4_noabort+0x14/0x20 mm/kasan/generic_report.c:134
 tcp_highest_sack_seq include/net/tcp.h:1864 [inline]
 tcp_highest_sack_seq include/net/tcp.h:1856 [inline]
 tcp_check_sack_reordering+0x33c/0x3a0 net/ipv4/tcp_input.c:891
 tcp_try_undo_partial net/ipv4/tcp_input.c:2730 [inline]
 tcp_fastretrans_alert+0xf74/0x23f0 net/ipv4/tcp_input.c:2847
 tcp_ack+0x2577/0x5bf0 net/ipv4/tcp_input.c:3710
 tcp_rcv_established+0x6dd/0x1e90 net/ipv4/tcp_input.c:5706
 tcp_v4_do_rcv+0x619/0x8d0 net/ipv4/tcp_ipv4.c:1619
 tcp_v4_rcv+0x307f/0x3b40 net/ipv4/tcp_ipv4.c:2001
 ip_protocol_deliver_rcu+0x5a/0x880 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x23b/0x380 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ip_local_deliver+0x1e9/0x520 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x1db/0x2f0 net/ipv4/ip_input.c:428
 NF_HOOK include/linux/netfilter.h:307 [inline]
 NF_HOOK include/linux/netfilter.h:301 [inline]
 ip_rcv+0xe8/0x3f0 net/ipv4/ip_input.c:538
 __netif_receive_skb_one_core+0x113/0x1a0 net/core/dev.c:5148
 __netif_receive_skb+0x2c/0x1d0 net/core/dev.c:5262
 process_backlog+0x206/0x750 net/core/dev.c:6093
 napi_poll net/core/dev.c:6530 [inline]
 net_rx_action+0x508/0x1120 net/core/dev.c:6598
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 run_ksoftirqd kernel/softirq.c:603 [inline]
 run_ksoftirqd+0x8e/0x110 kernel/softirq.c:595
 smpboot_thread_fn+0x6a3/0xa40 kernel/smpboot.c:165
 kthread+0x361/0x430 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 10091:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 __kasan_kmalloc mm/kasan/common.c:513 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:486
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:521
 slab_post_alloc_hook mm/slab.h:584 [inline]
 slab_alloc_node mm/slab.c:3263 [inline]
 kmem_cache_alloc_node+0x138/0x740 mm/slab.c:3575
 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:198
 alloc_skb_fclone include/linux/skbuff.h:1099 [inline]
 sk_stream_alloc_skb net/ipv4/tcp.c:875 [inline]
 sk_stream_alloc_skb+0x113/0xc90 net/ipv4/tcp.c:852
 tcp_sendmsg_locked+0xcf9/0x3470 net/ipv4/tcp.c:1282
 tcp_sendmsg+0x30/0x50 net/ipv4/tcp.c:1432
 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:672
 __sys_sendto+0x262/0x380 net/socket.c:1998
 __do_sys_sendto net/socket.c:2010 [inline]
 __se_sys_sendto net/socket.c:2006 [inline]
 __x64_sys_sendto+0xe1/0x1a0 net/socket.c:2006
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 10095:
 save_stack+0x23/0x90 mm/kasan/common.c:72
 set_track mm/kasan/common.c:80 [inline]
 kasan_set_free_info mm/kasan/common.c:335 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:474
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:483
 __cache_free mm/slab.c:3426 [inline]
 kmem_cache_free+0x86/0x320 mm/slab.c:3694
 kfree_skbmem+0x178/0x1c0 net/core/skbuff.c:645
 __kfree_skb+0x1e/0x30 net/core/skbuff.c:681
 sk_eat_skb include/net/sock.h:2453 [inline]
 tcp_recvmsg+0x1252/0x2930 net/ipv4/tcp.c:2166
 inet_recvmsg+0x136/0x610 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:886 [inline]
 sock_recvmsg net/socket.c:904 [inline]
 sock_recvmsg+0xce/0x110 net/socket.c:900
 __sys_recvfrom+0x1ff/0x350 net/socket.c:2055
 __do_sys_recvfrom net/socket.c:2073 [inline]
 __se_sys_recvfrom net/socket.c:2069 [inline]
 __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:2069
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff8880a488d040
 which belongs to the cache skbuff_fclone_cache of size 456
The buggy address is located 40 bytes inside of
 456-byte region [ffff8880a488d040, ffff8880a488d208)
The buggy address belongs to the page:
page:ffffea0002922340 refcount:1 mapcount:0 mapping:ffff88821b057000 index:0x0
raw: 00fffe0000000200 ffffea00022a5788 ffffea0002624a48 ffff88821b057000
raw: 0000000000000000 ffff8880a488d040 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880a488cf00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 ffff8880a488cf80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8880a488d000: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                          ^
 ffff8880a488d080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a488d100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 853697504d ("tcp: Fix highest_sack and highest_sack_seq")
Fixes: 50895b9de1 ("tcp: highest_sack fix")
Fixes: 737ff31456 ("tcp: use sequence distance to detect reordering")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Cambda Zhu <cambda@linux.alibaba.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 09:06:48 +01:00
Colin Ian King
4d299f1833 net/rose: fix spelling mistake "to" -> "too"
There is a spelling mistake in a printk message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 08:12:06 +01:00
Colin Ian King
43d88774d1 caif_usb: fix spelling mistake "to" -> "too"
There is a spelling mistake in a pr_warn message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 08:12:06 +01:00
Colin Ian King
971485a0d6 ipvs: fix spelling mistake "to" -> "too"
There is a spelling mistake in a IP_VS_ERR_RL message. Fix it.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 08:12:06 +01:00
Cong Wang
61678d28d4 net_sched: fix datalen for ematch
syzbot reported an out-of-bound access in em_nbyte. As initially
analyzed by Eric, this is because em_nbyte sets its own em->datalen
in em_nbyte_change() other than the one specified by user, but this
value gets overwritten later by its caller tcf_em_validate().
We should leave em->datalen untouched to respect their choices.

I audit all the in-tree ematch users, all of those implement
->change() set em->datalen, so we can just avoid setting it twice
in this case.

Reported-and-tested-by: syzbot+5af9a90dad568aa9f611@syzkaller.appspotmail.com
Reported-by: syzbot+2f07903a5b05e7f36410@syzkaller.appspotmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 21:34:42 +01:00
Kristian Evensen
bb48eb9b12 fou: Fix IPv6 netlink policy
When submitting v2 of "fou: Support binding FoU socket" (1713cb37bf),
I accidentally sent the wrong version of the patch and one fix was
missing. In the initial version of the patch, as well as the version 2
that I submitted, I incorrectly used ".type" for the two V6-attributes.
The correct is to use ".len".

Reported-by: Dmitry Vyukov <dvyukov@google.com>
Fixes: 1713cb37bf ("fou: Support binding FoU socket")
Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 14:32:52 +01:00
Mohit P. Tahiliani
ec97ecf1eb net: sched: add Flow Queue PIE packet scheduler
Principles:
  - Packets are classified on flows.
  - This is a Stochastic model (as we use a hash, several flows might
                                be hashed to the same slot)
  - Each flow has a PIE managed queue.
  - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.
  - For a given flow, packets are not reordered.
  - Drops during enqueue only.
  - ECN capability is off by default.
  - ECN threshold (if ECN is enabled) is at 10% by default.
  - Uses timestamps to calculate queue delay by default.

Usage:
tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
                    [ target TIME ] [ tupdate TIME ]
                    [ alpha NUMBER ] [ beta NUMBER ]
                    [ quantum BYTES ] [ memory_limit BYTES ]
                    [ ecnprob PERCENTAGE ] [ [no]ecn ]
                    [ [no]bytemode ] [ [no_]dq_rate_estimator ]

defaults:
  limit: 10240 packets, flows: 1024
  target: 15 ms, tupdate: 15 ms (in jiffies)
  alpha: 1/8, beta : 5/4
  quantum: device MTU, memory_limit: 32 Mb
  ecnprob: 10%, ecn: off
  bytemode: off, dq_rate_estimator: off

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com>
Signed-off-by: V. Saicharan <vsaicharan1998@gmail.com>
Signed-off-by: Mohit Bhasi <mohitbhasi1998@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
5205ea00cd net: sched: pie: export symbols to be reused by FQ-PIE
This patch makes the drop_early(), calculate_probability() and
pie_process_dequeue() functions generic enough to be used by
both PIE and FQ-PIE (to be added in a future commit). The major
change here is in the way the functions take in arguments. This
patch exports these functions and makes FQ-PIE dependent on
sch_pie.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
00ea2fb727 net: sched: pie: fix alignment in struct instances
Make the alignment in the initialization of the struct instances
consistent in the file.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
55f780c4a6 net: sched: pie: fix commenting
Fix punctuation and logical mistakes in the comments. The
logical mistake was that "dequeue_rate" is no longer the default
way to calculate queuing delay and is not needed. The default
way to calculate queue delay was changed in commit cec2975f2b
("net: sched: pie: enable timestamp based delay calculation").

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
2dfb1952a9 pie: rearrange structure members and their initializations
Rearrange the members of the structure such that closely
referenced members appear together and/or fit in the same
cacheline. Also, change the order of their initializations to
match the order in which they appear in the structure.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
84bf557fb0 net: sched: pie: move common code to pie.h
This patch moves macros, structures and small functions common
to PIE and FQ-PIE (to be added in a future commit) from the file
net/sched/sch_pie.c to the header file include/net/pie.h.
All the moved functions are made inline.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:30 +01:00
Eric Dumazet
d836f5c69d net: rtnetlink: validate IFLA_MTU attribute in rtnl_create_link()
rtnl_create_link() needs to apply dev->min_mtu and dev->max_mtu
checks that we apply in do_setlink()

Otherwise malicious users can crash the kernel, for example after
an integer overflow :

BUG: KASAN: use-after-free in memset include/linux/string.h:365 [inline]
BUG: KASAN: use-after-free in __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
Write of size 32 at addr ffff88819f20b9c0 by task swapper/0/0

CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.5.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xd4/0x30b mm/kasan/report.c:374
 __kasan_report.cold+0x1b/0x41 mm/kasan/report.c:506
 kasan_report+0x12/0x20 mm/kasan/common.c:639
 check_memory_region_inline mm/kasan/generic.c:185 [inline]
 check_memory_region+0x134/0x1a0 mm/kasan/generic.c:192
 memset+0x24/0x40 mm/kasan/common.c:108
 memset include/linux/string.h:365 [inline]
 __alloc_skb+0x37b/0x5e0 net/core/skbuff.c:238
 alloc_skb include/linux/skbuff.h:1049 [inline]
 alloc_skb_with_frags+0x93/0x590 net/core/skbuff.c:5664
 sock_alloc_send_pskb+0x7ad/0x920 net/core/sock.c:2242
 sock_alloc_send_skb+0x32/0x40 net/core/sock.c:2259
 mld_newpack+0x1d7/0x7f0 net/ipv6/mcast.c:1609
 add_grhead.isra.0+0x299/0x370 net/ipv6/mcast.c:1713
 add_grec+0x7db/0x10b0 net/ipv6/mcast.c:1844
 mld_send_cr net/ipv6/mcast.c:1970 [inline]
 mld_ifc_timer_expire+0x3d3/0x950 net/ipv6/mcast.c:2477
 call_timer_fn+0x1ac/0x780 kernel/time/timer.c:1404
 expire_timers kernel/time/timer.c:1449 [inline]
 __run_timers kernel/time/timer.c:1773 [inline]
 __run_timers kernel/time/timer.c:1740 [inline]
 run_timer_softirq+0x6c3/0x1790 kernel/time/timer.c:1786
 __do_softirq+0x262/0x98c kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0x19b/0x1e0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x1a3/0x610 arch/x86/kernel/apic/apic.c:1137
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:829
 </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: 98 6b ea f9 eb 8a cc cc cc cc cc cc e9 07 00 00 00 0f 00 2d 44 1c 60 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d 34 1c 60 00 fb f4 <c3> cc 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 4e 5d 9a f9 e8 79
RSP: 0018:ffffffff89807ce8 EFLAGS: 00000286 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff13266ae RBX: ffffffff8987a1c0 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffffffff8987aa54
RBP: ffffffff89807d18 R08: ffffffff8987a1c0 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: dffffc0000000000
R13: ffffffff8a799980 R14: 0000000000000000 R15: 0000000000000000
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:690
 default_idle_call+0x84/0xb0 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x3c8/0x6e0 kernel/sched/idle.c:269
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:361
 rest_init+0x23b/0x371 init/main.c:451
 arch_call_rest_init+0xe/0x1b
 start_kernel+0x904/0x943 init/main.c:784
 x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
 x86_64_start_kernel+0x77/0x7b arch/x86/kernel/head64.c:471
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:242

The buggy address belongs to the page:
page:ffffea00067c82c0 refcount:0 mapcount:0 mapping:0000000000000000 index:0x0
raw: 057ffe0000000000 ffffea00067c82c8 ffffea00067c82c8 0000000000000000
raw: 0000000000000000 0000000000000000 00000000ffffffff 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88819f20b880: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88819f20b900: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
>ffff88819f20b980: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
                                           ^
 ffff88819f20ba00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff
 ffff88819f20ba80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff

Fixes: 61e84623ac ("net: centralize net_device min/max MTU checking")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:14:21 +01:00
David S. Miller
954b3c4397 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-01-22

The following pull-request contains BPF updates for your *net-next* tree.

We've added 92 non-merge commits during the last 16 day(s) which contain
a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

The main changes are:

1) function by function verification and program extensions from Alexei.

2) massive cleanup of selftests/bpf from Toke and Andrii.

3) batched bpf map operations from Brian and Yonghong.

4) tcp congestion control in bpf from Martin.

5) bulking for non-map xdp_redirect form Toke.

6) bpf_send_signal_thread helper from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 08:10:16 +01:00
Martin KaFai Lau
5576b991e9 bpf: Add BPF_FUNC_jiffies64
This patch adds a helper to read the 64bit jiffies.  It will be used
in a later patch to implement the bpf_cubic.c.

The helper is inlined for jit_requested and 64 BITS_PER_LONG
as the map_gen_lookup().  Other cases could be considered together
with map_gen_lookup() if needed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200122233646.903260-1-kafai@fb.com
2020-01-22 16:30:10 -08:00
Trond Myklebust
809fe3c533 sunrpc: clean up cache entry add/remove from hashtable
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22 16:25:41 -05:00
Trond Myklebust
1d82163714 sunrpc: Fix potential leaks in sunrpc_cache_unhash()
When we unhash the cache entry, we need to handle any pending upcalls
by calling cache_fresh_unlocked().

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2020-01-22 16:25:41 -05:00
Alain Michaud
6613babaf6 Bluetooth: fix appearance typo in mgmt.c
This change addresses a typo in the set_appearance handler.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-22 21:23:16 +01:00
William Dauchy
5311a69aac net, ip6_tunnel: fix namespaces move
in the same manner as commit d0f4185160 ("net, ip_tunnel: fix
namespaces move"), fix namespace moving as it was broken since commit
8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnel"), but for
ipv6 this time; there is no reason to keep it for ip6_tunnel.

Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnel")
Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-22 21:13:40 +01:00
Eric Dumazet
36d79af7fb net_sched: use validated TCA_KIND attribute in tc_new_tfilter()
sysbot found another issue in tc_new_tfilter().
We probably should use @name which contains the sanitized
version of TCA_KIND.

BUG: KMSAN: uninit-value in string_nocheck lib/vsprintf.c:608 [inline]
BUG: KMSAN: uninit-value in string+0x522/0x690 lib/vsprintf.c:689
CPU: 1 PID: 10753 Comm: syz-executor.1 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x220 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:118
 __msan_warning+0x58/0xa0 mm/kmsan/kmsan_instr.c:215
 string_nocheck lib/vsprintf.c:608 [inline]
 string+0x522/0x690 lib/vsprintf.c:689
 vsnprintf+0x207d/0x31b0 lib/vsprintf.c:2574
 __request_module+0x2ad/0x11c0 kernel/kmod.c:143
 tcf_proto_lookup_ops+0x241/0x720 net/sched/cls_api.c:139
 tcf_proto_create net/sched/cls_api.c:262 [inline]
 tc_new_tfilter+0x2a4e/0x5010 net/sched/cls_api.c:2058
 rtnetlink_rcv_msg+0xcb7/0x1570 net/core/rtnetlink.c:5415
 netlink_rcv_skb+0x451/0x650 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5442
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf9e/0x1100 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x1248/0x14d0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45b349
Code: ad b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b6 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f88b3948c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00007f88b39496d4 RCX: 000000000045b349
RDX: 0000000000000000 RSI: 00000000200001c0 RDI: 0000000000000003
RBP: 000000000075bfc8 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000ffffffff
R13: 000000000000099f R14: 00000000004cb163 R15: 000000000075bfd4

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:144 [inline]
 kmsan_internal_poison_shadow+0x66/0xd0 mm/kmsan/kmsan.c:127
 kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:82
 slab_alloc_node mm/slub.c:2774 [inline]
 __kmalloc_node_track_caller+0xb40/0x1200 mm/slub.c:4382
 __kmalloc_reserve net/core/skbuff.c:141 [inline]
 __alloc_skb+0x2fd/0xac0 net/core/skbuff.c:209
 alloc_skb include/linux/skbuff.h:1049 [inline]
 netlink_alloc_large_skb net/netlink/af_netlink.c:1174 [inline]
 netlink_sendmsg+0x7d3/0x14d0 net/netlink/af_netlink.c:1892
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg net/socket.c:659 [inline]
 ____sys_sendmsg+0x12b6/0x1350 net/socket.c:2330
 ___sys_sendmsg net/socket.c:2384 [inline]
 __sys_sendmsg+0x451/0x5f0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
 __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
 do_syscall_64+0xb8/0x160 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 6f96c3c690 ("net_sched: fix backward compatibility for TCA_KIND")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-22 21:11:50 +01:00
Paolo Abeni
d39ca2590d Revert "udp: do rmem bulk free even if the rx sk queue is empty"
This reverts commit 0d4a6608f6.

Williem reported that after commit 0d4a6608f6 ("udp: do rmem bulk
free even if the rx sk queue is empty") the memory allocated by
an almost idle system with many UDP sockets can grow a lot.

For stable kernel keep the solution as simple as possible and revert
the offending commit.

Reported-by: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Diagnosed-by: Eric Dumazet <eric.dumazet@gmail.com>
Fixes: 0d4a6608f6 ("udp: do rmem bulk free even if the rx sk queue is empty")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-22 21:09:03 +01:00
Maxim Mikityanskiy
c80794323e net: Fix packet reordering caused by GRO and listified RX cooperation
Commit 323ebb61e3 ("net: use listified RX for handling GRO_NORMAL
skbs") introduces batching of GRO_NORMAL packets in napi_frags_finish,
and commit 6570bc79c0 ("net: core: use listified Rx for GRO_NORMAL in
napi_gro_receive()") adds the same to napi_skb_finish. However,
dev_gro_receive (that is called just before napi_{frags,skb}_finish) can
also pass skbs to the networking stack: e.g., when the GRO session is
flushed, napi_gro_complete is called, which passes pp directly to
netif_receive_skb_internal, skipping napi->rx_list. It means that the
packet stored in pp will be handled by the stack earlier than the
packets that arrived before, but are still waiting in napi->rx_list. It
leads to TCP reorderings that can be observed in the TCPOFOQueue counter
in netstat.

This commit fixes the reordering issue by making napi_gro_complete also
use napi->rx_list, so that all packets going through GRO will keep their
order. In order to keep napi_gro_flush working properly, gro_normal_list
calls are moved after the flush to clear napi->rx_list.

iwlwifi calls napi_gro_flush directly and does the same thing that is
done by gro_normal_list, so the same change is applied there:
napi_gro_flush is moved to be before the flush of napi->rx_list.

A few other drivers also use napi_gro_flush (brocade/bna/bnad.c,
cortina/gemini.c, hisilicon/hns3/hns3_enet.c). The first two also use
napi_complete_done afterwards, which performs the gro_normal_list flush,
so they are fine. The latter calls napi_gro_receive right after
napi_gro_flush, so it can end up with non-empty napi->rx_list anyway.

Fixes: 323ebb61e3 ("net: use listified RX for handling GRO_NORMAL skbs")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Cc: Alexander Lobakin <alobakin@dlink.ru>
Cc: Edward Cree <ecree@solarflare.com>
Acked-by: Alexander Lobakin <alobakin@dlink.ru>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-22 20:36:37 +01:00
Jakub Sitnicki
58c8db929d net, sk_msg: Don't check if sock is locked when tearing down psock
As John Fastabend reports [0], psock state tear-down can happen on receive
path *after* unlocking the socket, if the only other psock user, that is
sockmap or sockhash, releases its psock reference before tcp_bpf_recvmsg
does so:

 tcp_bpf_recvmsg()
  psock = sk_psock_get(sk)                         <- refcnt 2
  lock_sock(sk);
  ...
                                  sock_map_free()  <- refcnt 1
  release_sock(sk)
  sk_psock_put()                                   <- refcnt 0

Remove the lockdep check for socket lock in psock tear-down that got
introduced in 7e81a35302 ("bpf: Sockmap, ensure sock lock held during
tear down").

[0] https://lore.kernel.org/netdev/5e25dc995d7d_74082aaee6e465b441@john-XPS-13-9370.notmuch/

Fixes: 7e81a35302 ("bpf: Sockmap, ensure sock lock held during tear down")
Reported-by: syzbot+d73682fcf7fee6982fe3@syzkaller.appspotmail.com
Suggested-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-22 20:30:20 +01:00
Alain Michaud
bdf2aca703 Bluetooth: adding missing const decoration to mgmt_status_table
This change simply adds a missing const decoration to the
mtmt_status_table definition.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-22 17:37:25 +01:00
Björn Töpel
43a825afc9 xsk, net: Make sock_def_readable() have external linkage
XDP sockets use the default implementation of struct sock's
sk_data_ready callback, which is sock_def_readable(). This function
is called in the XDP socket fast-path, and involves a retpoline. By
letting sock_def_readable() have external linkage, and being called
directly, the retpoline can be avoided.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200120092917.13949-1-bjorn.topel@gmail.com
2020-01-22 00:08:52 +01:00
William Dauchy
d0f4185160 net, ip_tunnel: fix namespaces move
in the same manner as commit 690afc165b ("net: ip6_gre: fix moving
ip6gre between namespaces"), fix namespace moving as it was broken since
commit 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.").
Indeed, the ip6_gre commit removed the local flag for collect_md
condition, so there is no reason to keep it for ip_gre/ip_tunnel.

this patch will fix both ip_tunnel and ip_gre modules.

Fixes: 2e15ea390e ("ip_gre: Add support to collect tunnel metadata.")
Signed-off-by: William Dauchy <w.dauchy@criteo.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 16:05:21 +01:00
David S. Miller
4f2c17e0f3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-01-21

1) Add support for TCP encapsulation of IKE and ESP messages,
   as defined by RFC 8229. Patchset from Sabrina Dubroca.

Please note that there is a merge conflict in:

net/unix/af_unix.c

between commit:

3c32da19a8 ("unix: Show number of pending scm files of receive queue in fdinfo")

from the net-next tree and commit:

b50b0580d2 ("net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram")

from the ipsec-next tree.

The conflict can be solved as done in linux-next.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 12:18:20 +01:00
Alex Shi
aeaec7bceb tcp/ipv4: remove AF_INET_FAMILY
After commit 079096f103 ("tcp/dccp: install syn_recv requests into ehash table")
the macro isn't used anymore. remove it.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 12:03:51 +01:00
Alex Shi
339821f7d2 net/hsr: remove seq_nr_after_or_eq
It's never used after introduced. So maybe better to remove.

Signed-off-by: Alex Shi <alex.shi@linux.alibaba.com>
Cc: Arvid Brodin <arvid.brodin@alten.se>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 12:03:21 +01:00
Hans Wippel
29237d22bc net/smc: allow unprivileged users to read pnet table
The current flags of the SMC_PNET_GET command only allow privileged
users to retrieve entries from the pnet table via netlink. The content
of the pnet table may be useful for all users though, e.g., for
debugging smc connection problems.

This patch removes the GENL_ADMIN_PERM flag so that unprivileged users
can read the pnet table.

Signed-off-by: Hans Wippel <ndev@hwipl.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 11:39:56 +01:00
Theodore Dubois
bfe02b9f94 tcp: remove redundant assigment to snd_cwnd
Not sure how this got in here. git blame says the second assignment was
added in 3a9a57f6, but that commit also removed the first assignment.

Signed-off-by: Theodore Dubois <tblodt@icloud.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 10:52:29 +01:00
Wen Yang
5b2f1f3070 tcp_bbr: improve arithmetic division in bbr_update_bw()
do_div() does a 64-by-32 division. Use div64_long() instead of it
if the divisor is long, to avoid truncation to 32-bit.
And as a nice side effect also cleans up the function a bit.

Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 10:45:49 +01:00
Jouni Hogander
cb626bf566 net-sysfs: Fix reference count leak
Netdev_register_kobject is calling device_initialize. In case of error
reference taken by device_initialize is not given up.

Drivers are supposed to call free_netdev in case of error. In non-error
case the last reference is given up there and device release sequence
is triggered. In error case this reference is kept and the release
sequence is never started.

Fix this by setting reg_state as NETREG_UNREGISTERED if registering
fails.

This is the rootcause for couple of memory leaks reported by Syzkaller:

BUG: memory leak unreferenced object 0xffff8880675ca008 (size 256):
  comm "netdev_register", pid 281, jiffies 4294696663 (age 6.808s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
  backtrace:
    [<0000000058ca4711>] kmem_cache_alloc_trace+0x167/0x280
    [<000000002340019b>] device_add+0x882/0x1750
    [<000000001d588c3a>] netdev_register_kobject+0x128/0x380
    [<0000000011ef5535>] register_netdevice+0xa1b/0xf00
    [<000000007fcf1c99>] __tun_chr_ioctl+0x20d5/0x3dd0
    [<000000006a5b7b2b>] tun_chr_ioctl+0x2f/0x40
    [<00000000f30f834a>] do_vfs_ioctl+0x1c7/0x1510
    [<00000000fba062ea>] ksys_ioctl+0x99/0xb0
    [<00000000b1c1b8d2>] __x64_sys_ioctl+0x78/0xb0
    [<00000000984cabb9>] do_syscall_64+0x16f/0x580
    [<000000000bde033d>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [<00000000e6ca2d9f>] 0xffffffffffffffff

BUG: memory leak
unreferenced object 0xffff8880668ba588 (size 8):
  comm "kobject_set_nam", pid 286, jiffies 4294725297 (age 9.871s)
  hex dump (first 8 bytes):
    6e 72 30 00 cc be df 2b                          nr0....+
  backtrace:
    [<00000000a322332a>] __kmalloc_track_caller+0x16e/0x290
    [<00000000236fd26b>] kstrdup+0x3e/0x70
    [<00000000dd4a2815>] kstrdup_const+0x3e/0x50
    [<0000000049a377fc>] kvasprintf_const+0x10e/0x160
    [<00000000627fc711>] kobject_set_name_vargs+0x5b/0x140
    [<0000000019eeab06>] dev_set_name+0xc0/0xf0
    [<0000000069cb12bc>] netdev_register_kobject+0xc8/0x320
    [<00000000f2e83732>] register_netdevice+0xa1b/0xf00
    [<000000009e1f57cc>] __tun_chr_ioctl+0x20d5/0x3dd0
    [<000000009c560784>] tun_chr_ioctl+0x2f/0x40
    [<000000000d759e02>] do_vfs_ioctl+0x1c7/0x1510
    [<00000000351d7c31>] ksys_ioctl+0x99/0xb0
    [<000000008390040a>] __x64_sys_ioctl+0x78/0xb0
    [<0000000052d196b7>] do_syscall_64+0x16f/0x580
    [<0000000019af9236>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
    [<00000000bc384531>] 0xffffffffffffffff

v3 -> v4:
  Set reg_state to NETREG_UNREGISTERED if registering fails

v2 -> v3:
* Replaced BUG_ON with WARN_ON in free_netdev and netdev_release

v1 -> v2:
* Relying on driver calling free_netdev rather than calling
  put_device directly in error path

Reported-by: syzbot+ad8ca40ecd77896d51e2@syzkaller.appspotmail.com
Cc: David Miller <davem@davemloft.net>
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Jouni Hogander <jouni.hogander@unikie.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 10:43:06 +01:00
David S. Miller
ad063075d4 Use ODP MRs for kernel ULPs
The following series extends MR creation routines to allow creation of
 user MRs through kernel ULPs as a proxy. The immediate use case is to
 allow RDS to work over FS-DAX, which requires ODP (on-demand-paging)
 MRs to be created and such MRs were not possible to create prior this
 series.
 
 The first part of this patchset extends RDMA to have special verb
 ib_reg_user_mr(). The common use case that uses this function is a
 userspace application that allocates memory for HCA access but the
 responsibility to register the memory at the HCA is on an kernel ULP.
 This ULP acts as an agent for the userspace application.
 
 The second part provides advise MR functionality for ULPs. This is
 integral part of ODP flows and used to trigger pagefaults in advance
 to prepare memory before running working set.
 
 The third part is actual user of those in-kernel APIs.
 -----BEGIN PGP SIGNATURE-----
 
 iHUEABYIAB0WIQT1m3YD37UfMCUQBNwp8NhrnBAZsQUCXiVO8AAKCRAp8NhrnBAZ
 scTrAP9gb0d3qv0IOtHw5aGI1DAgjTUn/SzUOnsjDEn7DIoh9gEA2+ZmaEyLXKrl
 +UcZb31auy5P8ueJYokRLhLAyRcOIAg=
 =yaHb
 -----END PGP SIGNATURE-----

Merge tag 'rds-odp-for-5.5' of https://git.kernel.org/pub/scm/linux/kernel/git/leon/linux-rdma

Leon Romanovsky says:

====================
Use ODP MRs for kernel ULPs

The following series extends MR creation routines to allow creation of
user MRs through kernel ULPs as a proxy. The immediate use case is to
allow RDS to work over FS-DAX, which requires ODP (on-demand-paging)
MRs to be created and such MRs were not possible to create prior this
series.

The first part of this patchset extends RDMA to have special verb
ib_reg_user_mr(). The common use case that uses this function is a
userspace application that allocates memory for HCA access but the
responsibility to register the memory at the HCA is on an kernel ULP.
This ULP acts as an agent for the userspace application.

The second part provides advise MR functionality for ULPs. This is
integral part of ODP flows and used to trigger pagefaults in advance
to prepare memory before running working set.

The third part is actual user of those in-kernel APIs.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 10:22:51 +01:00
Yuki Taguchi
62ebaeaede ipv6: sr: remove SKB_GSO_IPXIP6 on End.D* actions
After LRO/GRO is applied, SRv6 encapsulated packets have
SKB_GSO_IPXIP6 feature flag, and this flag must be removed right after
decapulation procedure.

Currently, SKB_GSO_IPXIP6 flag is not removed on End.D* actions, which
creates inconsistent packet state, that is, a normal TCP/IP packets
have the SKB_GSO_IPXIP6 flag. This behavior can cause unexpected
fallback to GSO on routing to netdevices that do not support
SKB_GSO_IPXIP6. For example, on inter-VRF forwarding, decapsulated
packets separated into small packets by GSO because VRF devices do not
support TSO for packets with SKB_GSO_IPXIP6 flag, and this degrades
forwarding performance.

This patch removes encapsulation related GSO flags from the skb right
after the End.D* action is applied.

Fixes: d7a669dd2f ("ipv6: sr: add helper functions for seg6local")
Signed-off-by: Yuki Taguchi <tagyounit@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 10:17:07 +01:00
David S. Miller
9c5ed2f831 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2020-01-21

1) Fix packet tx through bpf_redirect() for xfrm and vti
   interfaces. From Nicolas Dichtel.

2) Do not confirm neighbor when do pmtu update on a virtual
   xfrm interface. From Xu Wang.

3) Support output_mark for offload ESP packets, this was
   forgotten when the output_mark was added initially.
   From Ulrich Weber.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 09:25:58 +01:00
Kadlecsik József
32c72165db netfilter: ipset: use bitmap infrastructure completely
The bitmap allocation did not use full unsigned long sizes
when calculating the required size and that was triggered by KASAN
as slab-out-of-bounds read in several places. The patch fixes all
of them.

Reported-by: syzbot+fabca5cbf5e54f3fe2de@syzkaller.appspotmail.com
Reported-by: syzbot+827ced406c9a1d9570ed@syzkaller.appspotmail.com
Reported-by: syzbot+190d63957b22ef673ea5@syzkaller.appspotmail.com
Reported-by: syzbot+dfccdb2bdb4a12ad425e@syzkaller.appspotmail.com
Reported-by: syzbot+df0d0f5895ef1f41a65b@syzkaller.appspotmail.com
Reported-by: syzbot+b08bd19bb37513357fd4@syzkaller.appspotmail.com
Reported-by: syzbot+53cdd0ec0bbabd53370a@syzkaller.appspotmail.com
Signed-off-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-20 17:41:45 +01:00
xiaofeng.yan
80892772c4 hsr: Fix a compilation error
A compliation error happen when building branch 5.5-rc7

In file included from net/hsr/hsr_main.c:12:0:
net/hsr/hsr_main.h:194:20: error: two or more data types in declaration specifiers
 static inline void void hsr_debugfs_rename(struct net_device *dev)

So Removed one void.

Fixes: 4c2d5e33dc ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: xiaofeng.yan <yanxiaofeng7@jd.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-20 13:18:40 +01:00
Niko Kortstrom
690afc165b net: ip6_gre: fix moving ip6gre between namespaces
Support for moving IPv4 GRE tunnels between namespaces was added in
commit b57708add3 ("gre: add x-netns support"). The respective change
for IPv6 tunnels, commit 22f08069e8 ("ip6gre: add x-netns support")
did not drop NETIF_F_NETNS_LOCAL flag so moving them from one netns to
another is still denied in IPv6 case. Drop NETIF_F_NETNS_LOCAL flag from
ip6gre tunnels to allow moving ip6gre tunnel endpoints between network
namespaces.

Signed-off-by: Niko Kortstrom <niko.kortstrom@nokia.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-20 09:59:24 +01:00
Ingo Molnar
cb6c82df68 Linux 5.5-rc7
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAl4k7i8eHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGvk0IAKRenVOdiudY77SQ
 VZjsteyrYTTQtPPv494ToIRjR0XQ+gYp8vyWzXTUC5Nm9Y9U3VzDqUPUjWszrSXE
 6mU+tzcMc9qwuUxnIFn8zfg64ygw+37sn/w3xqeH4QmF9Z5Wl3EX3SdXTs7jp3RS
 VxiztkUNI5ZBV2GDtla5K/9qLPqCQnUYXIiyi5lAtBtiitZDVXFp7dy7hMgEiaEO
 +78K5Kh3xlt5ndDsBFOlwIb2Oof3KL7bBXntdbSBc/bjol6IRvAgln48HWCv59G2
 jzAp2tj2KobX9GRAEPj+v4TQZEW0SXDNDi8MgQsM+3DYVCTmANsv57CBKRuf01+F
 nB1kAys=
 =zSnJ
 -----END PGP SIGNATURE-----

Merge tag 'v5.5-rc7' into perf/core, to pick up fixes

Signed-off-by: Ingo Molnar <mingo@kernel.org>
2020-01-20 08:43:44 +01:00
David S. Miller
b3f7e3f23a Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
Linus Torvalds
11a8272947 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Fix non-blocking connect() in x25, from Martin Schiller.

 2) Fix spurious decryption errors in kTLS, from Jakub Kicinski.

 3) Netfilter use-after-free in mtype_destroy(), from Cong Wang.

 4) Limit size of TSO packets properly in lan78xx driver, from Eric
    Dumazet.

 5) r8152 probe needs an endpoint sanity check, from Johan Hovold.

 6) Prevent looping in tcp_bpf_unhash() during sockmap/tls free, from
    John Fastabend.

 7) hns3 needs short frames padded on transmit, from Yunsheng Lin.

 8) Fix netfilter ICMP header corruption, from Eyal Birger.

 9) Fix soft lockup when low on memory in hns3, from Yonglong Liu.

10) Fix NTUPLE firmware command failures in bnxt_en, from Michael Chan.

11) Fix memory leak in act_ctinfo, from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (91 commits)
  cxgb4: reject overlapped queues in TC-MQPRIO offload
  cxgb4: fix Tx multi channel port rate limit
  net: sched: act_ctinfo: fix memory leak
  bnxt_en: Do not treat DSN (Digital Serial Number) read failure as fatal.
  bnxt_en: Fix ipv6 RFS filter matching logic.
  bnxt_en: Fix NTUPLE firmware command failures.
  net: systemport: Fixed queue mapping in internal ring map
  net: dsa: bcm_sf2: Configure IMP port for 2Gb/sec
  net: dsa: sja1105: Don't error out on disabled ports with no phy-mode
  net: phy: dp83867: Set FORCE_LINK_GOOD to default after reset
  net: hns: fix soft lockup when there is not enough memory
  net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()
  net/sched: act_ife: initalize ife->metalist earlier
  netfilter: nat: fix ICMP header corruption on ICMP errors
  net: wan: lapbether.c: Use built-in RCU list checking
  netfilter: nf_tables: fix flowtable list del corruption
  netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks()
  netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
  netfilter: nft_tunnel: ERSPAN_VERSION must not be null
  netfilter: nft_tunnel: fix null-attribute check
  ...
2020-01-19 12:03:53 -08:00
Amit Cohen
c3cae4916e devlink: Add overlay source MAC is multicast trap
Add packet trap that can report NVE packets that the device decided to
drop because their overlay source MAC is multicast.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:23:52 +01:00
Amit Cohen
13c056ec7d devlink: Add tunnel generic packet traps
Add packet traps that can report packets that were dropped during tunnel
decapsulation.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:23:52 +01:00
Amit Cohen
95f0ead8f0 devlink: Add non-routable packet trap
Add packet trap that can report packets that reached the router, but are
non-routable. For example, IGMP queries can be flooded by the device in
layer 2 and reach the router. Such packets should not be routed and
instead dropped.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:23:52 +01:00
Eric Dumazet
09d4f10a5e net: sched: act_ctinfo: fix memory leak
Implement a cleanup method to properly free ci->params

BUG: memory leak
unreferenced object 0xffff88811746e2c0 (size 64):
  comm "syz-executor617", pid 7106, jiffies 4294943055 (age 14.250s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
    c0 34 60 84 ff ff ff ff 00 00 00 00 00 00 00 00  .4`.............
  backtrace:
    [<0000000015aa236f>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<0000000015aa236f>] slab_post_alloc_hook mm/slab.h:586 [inline]
    [<0000000015aa236f>] slab_alloc mm/slab.c:3320 [inline]
    [<0000000015aa236f>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3549
    [<000000002c946bd1>] kmalloc include/linux/slab.h:556 [inline]
    [<000000002c946bd1>] kzalloc include/linux/slab.h:670 [inline]
    [<000000002c946bd1>] tcf_ctinfo_init+0x21a/0x530 net/sched/act_ctinfo.c:236
    [<0000000086952cca>] tcf_action_init_1+0x400/0x5b0 net/sched/act_api.c:944
    [<000000005ab29bf8>] tcf_action_init+0x135/0x1c0 net/sched/act_api.c:1000
    [<00000000392f56f9>] tcf_action_add+0x9a/0x200 net/sched/act_api.c:1410
    [<0000000088f3c5dd>] tc_ctl_action+0x14d/0x1bb net/sched/act_api.c:1465
    [<000000006b39d986>] rtnetlink_rcv_msg+0x178/0x4b0 net/core/rtnetlink.c:5424
    [<00000000fd6ecace>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
    [<0000000047493d02>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
    [<00000000bdcf8286>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    [<00000000bdcf8286>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
    [<00000000fc5b92d9>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
    [<00000000da84d076>] sock_sendmsg_nosec net/socket.c:639 [inline]
    [<00000000da84d076>] sock_sendmsg+0x54/0x70 net/socket.c:659
    [<0000000042fb2eee>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
    [<000000008f23f67e>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
    [<00000000d838e4f6>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
    [<00000000289a9cb1>] __do_sys_sendmsg net/socket.c:2426 [inline]
    [<00000000289a9cb1>] __se_sys_sendmsg net/socket.c:2424 [inline]
    [<00000000289a9cb1>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424

Fixes: 24ec483cec ("net: sched: Introduce act_ctinfo action")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Kevin 'ldir' Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Kevin 'ldir' Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:02:15 +01:00
David S. Miller
7f013edeba Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) Incorrect uapi header comment in bitwise, from Jeremy Sowden.

2) Fetch flow statistics if flow is still active.

3) Restrict flow matching on hardware based on input device.

4) Add nf_flow_offload_work_alloc() helper function.

5) Remove the last client of the FLOW_OFFLOAD_DYING flag, use teardown
   instead.

6) Use atomic bitwise operation to operate with flow flags.

7) Add nf_flowtable_hw_offload() helper function to check for the
   NF_FLOWTABLE_HW_OFFLOAD flag.

8) Add NF_FLOW_HW_REFRESH to retry hardware offload from the flowtable
   software datapath.

9) Remove indirect calls in xt_hashlimit, from Florian Westphal.

10) Add nf_flow_offload_tuple() helper to consolidate code.

11) Add nf_flow_table_offload_cmd() helper function.

12) A few whitespace cleanups in nf_tables in bitwise and the bitmap/hash
    set types, from Jeremy Sowden.

13) Cleanup netlink attribute checks in bitwise, from Jeremy Sowden.

14) Replace goto by return in error path of nft_bitwise_dump(), from
    Jeremy Sowden.

15) Add bitwise operation netlink attribute, also from Jeremy.

16) Add nft_bitwise_init_bool(), from Jeremy Sowden.

17) Add nft_bitwise_eval_bool(), also from Jeremy.

18) Add nft_bitwise_dump_bool(), from Jeremy Sowden.

19) Disallow hardware offload for other that NFT_BITWISE_BOOL,
    from Jeremy Sowden.

20) Add NFTA_BITWISE_DATA netlink attribute, again from Jeremy.

21) Add support for bitwise shift operation, from Jeremy Sowden.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 10:29:05 +01:00
Florian Westphal
7eaecf7963 netfilter: nft_osf: add missing check for DREG attribute
syzbot reports just another NULL deref crash because of missing test
for presence of the attribute.

Reported-by: syzbot+cf23983d697c26c34f60@syzkaller.appspotmail.com
Fixes:  b96af92d6e ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-18 21:18:41 +01:00
Hans Westgaard Ry
b2dfc6765e net/rds: Use prefetch for On-Demand-Paging MR
Try prefetching pages when using On-Demand-Paging MR using
ib_advise_mr.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-18 11:48:19 +02:00
Hans Westgaard Ry
2eafa1746f net/rds: Handle ODP mr registration/unregistration
On-Demand-Paging MRs are registered using ib_reg_user_mr and
unregistered with ib_dereg_mr.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-18 11:48:19 +02:00
Guillaume Nault
56f200c78c netns: Constify exported functions
Mark function parameters as 'const' where possible.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-17 13:25:24 +01:00
Cong Wang
53d374979e net: avoid updating qdisc_xmit_lock_key in netdev_update_lockdep_key()
syzbot reported some bogus lockdep warnings, for example bad unlock
balance in sch_direct_xmit(). They are due to a race condition between
slow path and fast path, that is qdisc_xmit_lock_key gets re-registered
in netdev_update_lockdep_key() on slow path, while we could still
acquire the queue->_xmit_lock on fast path in this small window:

CPU A						CPU B
						__netif_tx_lock();
lockdep_unregister_key(qdisc_xmit_lock_key);
						__netif_tx_unlock();
lockdep_register_key(qdisc_xmit_lock_key);

In fact, unlike the addr_list_lock which has to be reordered when
the master/slave device relationship changes, queue->_xmit_lock is
only acquired on fast path and only when NETIF_F_LLTX is not set,
so there is likely no nested locking for it.

Therefore, we can just get rid of re-registration of
qdisc_xmit_lock_key.

Reported-by: syzbot+4ec99438ed7450da6272@syzkaller.appspotmail.com
Fixes: ab92d68fc2 ("net: core: add generic lockdep keys")
Cc: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-17 11:02:44 +01:00
Eric Dumazet
44c23d7159 net/sched: act_ife: initalize ife->metalist earlier
It seems better to init ife->metalist earlier in tcf_ife_init()
to avoid the following crash :

kasan: CONFIG_KASAN_INLINE enabled
kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
CPU: 0 PID: 10483 Comm: syz-executor216 Not tainted 5.5.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:_tcf_ife_cleanup net/sched/act_ife.c:412 [inline]
RIP: 0010:tcf_ife_cleanup+0x6e/0x400 net/sched/act_ife.c:431
Code: 48 c1 ea 03 80 3c 02 00 0f 85 94 03 00 00 49 8b bd f8 00 00 00 48 b8 00 00 00 00 00 fc ff df 4c 8d 67 e8 48 89 fa 48 c1 ea 03 <80> 3c 02 00 0f 85 5c 03 00 00 48 bb 00 00 00 00 00 fc ff df 48 8b
RSP: 0018:ffffc90001dc6d00 EFLAGS: 00010246
RAX: dffffc0000000000 RBX: ffffffff864619c0 RCX: ffffffff815bfa09
RDX: 0000000000000000 RSI: 0000000000000004 RDI: 0000000000000000
RBP: ffffc90001dc6d50 R08: 0000000000000004 R09: fffff520003b8d8e
R10: fffff520003b8d8d R11: 0000000000000003 R12: ffffffffffffffe8
R13: ffff8880a79fc000 R14: ffff88809aba0e00 R15: 0000000000000000
FS:  0000000001b51880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000563f52cce140 CR3: 0000000093541000 CR4: 00000000001406f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 tcf_action_cleanup+0x62/0x1b0 net/sched/act_api.c:119
 __tcf_action_put+0xfa/0x130 net/sched/act_api.c:135
 __tcf_idr_release net/sched/act_api.c:165 [inline]
 __tcf_idr_release+0x59/0xf0 net/sched/act_api.c:145
 tcf_idr_release include/net/act_api.h:171 [inline]
 tcf_ife_init+0x97c/0x1870 net/sched/act_ife.c:616
 tcf_action_init_1+0x6b6/0xa40 net/sched/act_api.c:944
 tcf_action_init+0x21a/0x330 net/sched/act_api.c:1000
 tcf_action_add+0xf5/0x3b0 net/sched/act_api.c:1410
 tc_ctl_action+0x390/0x488 net/sched/act_api.c:1465
 rtnetlink_rcv_msg+0x45e/0xaf0 net/core/rtnetlink.c:5424
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0x58c/0x7d0 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x91c/0xea0 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:639 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:659
 ____sys_sendmsg+0x753/0x880 net/socket.c:2330
 ___sys_sendmsg+0x100/0x170 net/socket.c:2384
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2417
 __do_sys_sendmsg net/socket.c:2426 [inline]
 __se_sys_sendmsg net/socket.c:2424 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2424
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 11a94d7fd8 ("net/sched: act_ife: validate the control action inside init()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-17 10:58:15 +01:00
David S. Miller
a72b6a1ee4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter updates for net

The following patchset contains Netfilter fixes for net:

1) Fix use-after-free in ipset bitmap destroy path, from Cong Wang.

2) Missing init netns in entry cleanup path of arp_tables,
   from Florian Westphal.

3) Fix WARN_ON in set destroy path due to missing cleanup on
   transaction error.

4) Incorrect netlink sanity check in tunnel, from Florian Westphal.

5) Missing sanity check for erspan version netlink attribute, also
   from Florian.

6) Remove WARN in nft_request_module() that can be triggered from
   userspace, from Florian Westphal.

7) Memleak in NFTA_HOOK_DEVS netlink parser, from Dan Carpenter.

8) List poison from commit path for flowtables that are added and
   deleted in the same batch, from Florian Westphal.

9) Fix NAT ICMP packet corruption, from Eyal Birger.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-17 10:36:07 +01:00
Toke Høiland-Jørgensen
1d233886dd xdp: Use bulking for non-map XDP_REDIRECT and consolidate code paths
Since the bulk queue used by XDP_REDIRECT now lives in struct net_device,
we can re-use the bulking for the non-map version of the bpf_redirect()
helper. This is a simple matter of having xdp_do_redirect_slow() queue the
frame on the bulk queue instead of sending it out with __bpf_tx_xdp().

Unfortunately we can't make the bpf_redirect() helper return an error if
the ifindex doesn't exit (as bpf_redirect_map() does), because we don't
have a reference to the network namespace of the ingress device at the time
the helper is called. So we have to leave it as-is and keep the device
lookup in xdp_do_redirect_slow().

Since this leaves less reason to have the non-map redirect code in a
separate function, so we get rid of the xdp_do_redirect_slow() function
entirely. This does lose us the tracepoint disambiguation, but fortunately
the xdp_redirect and xdp_redirect_map tracepoints use the same tracepoint
entry structures. This means both can contain a map index, so we can just
amend the tracepoint definitions so we always emit the xdp_redirect(_err)
tracepoints, but with the map ID only populated if a map is present. This
means we retire the xdp_redirect_map(_err) tracepoints entirely, but keep
the definitions around in case someone is still listening for them.

With this change, the performance of the xdp_redirect sample program goes
from 5Mpps to 8.4Mpps (a 68% increase).

Since the flush functions are no longer map-specific, rename the flush()
functions to drop _map from their names. One of the renamed functions is
the xdp_do_flush_map() callback used in all the xdp-enabled drivers. To
keep from having to update all drivers, use a #define to keep the old name
working, and only update the virtual drivers in this patch.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768505.1458396.17518057312953572912.stgit@toke.dk
2020-01-16 20:03:34 -08:00
Toke Høiland-Jørgensen
75ccae62cb xdp: Move devmap bulk queue into struct net_device
Commit 96360004b8 ("xdp: Make devmap flush_list common for all map
instances"), changed devmap flushing to be a global operation instead of a
per-map operation. However, the queue structure used for bulking was still
allocated as part of the containing map.

This patch moves the devmap bulk queue into struct net_device. The
motivation for this is reusing it for the non-map variant of XDP_REDIRECT,
which will be changed in a subsequent commit.  To avoid other fields of
struct net_device moving to different cache lines, we also move a couple of
other members around.

We defer the actual allocation of the bulk queue structure until the
NETDEV_REGISTER notification devmap.c. This makes it possible to check for
ndo_xdp_xmit support before allocating the structure, which is not possible
at the time struct net_device is allocated. However, we keep the freeing in
free_netdev() to avoid adding another RCU callback on NETDEV_UNREGISTER.

Because of this change, we lose the reference back to the map that
originated the redirect, so change the tracepoint to always return 0 as the
map ID and index. Otherwise no functional change is intended with this
patch.

After this patch, the relevant part of struct net_device looks like this,
according to pahole:

	/* --- cacheline 14 boundary (896 bytes) --- */
	struct netdev_queue *      _tx __attribute__((__aligned__(64))); /*   896     8 */
	unsigned int               num_tx_queues;        /*   904     4 */
	unsigned int               real_num_tx_queues;   /*   908     4 */
	struct Qdisc *             qdisc;                /*   912     8 */
	unsigned int               tx_queue_len;         /*   920     4 */
	spinlock_t                 tx_global_lock;       /*   924     4 */
	struct xdp_dev_bulk_queue * xdp_bulkq;           /*   928     8 */
	struct xps_dev_maps *      xps_cpus_map;         /*   936     8 */
	struct xps_dev_maps *      xps_rxqs_map;         /*   944     8 */
	struct mini_Qdisc *        miniq_egress;         /*   952     8 */
	/* --- cacheline 15 boundary (960 bytes) --- */
	struct hlist_head  qdisc_hash[16];               /*   960   128 */
	/* --- cacheline 17 boundary (1088 bytes) --- */
	struct timer_list  watchdog_timer;               /*  1088    40 */

	/* XXX last struct has 4 bytes of padding */

	int                        watchdog_timeo;       /*  1128     4 */

	/* XXX 4 bytes hole, try to pack */

	struct list_head   todo_list;                    /*  1136    16 */
	/* --- cacheline 18 boundary (1152 bytes) --- */

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/157918768397.1458396.12673224324627072349.stgit@toke.dk
2020-01-16 20:03:34 -08:00
Jeremy Sowden
567d746b55 netfilter: bitwise: add support for shifts.
Hitherto nft_bitwise has only supported boolean operations: NOT, AND, OR
and XOR.  Extend it to do shifts as well.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:52:02 +01:00
Jeremy Sowden
779f725e14 netfilter: bitwise: add NFTA_BITWISE_DATA attribute.
Add a new bitwise netlink attribute that will be used by shift
operations to store the size of the shift.  It is not used by boolean
operations.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:52:02 +01:00
Jeremy Sowden
ed991d4363 netfilter: bitwise: only offload boolean operations.
Only boolean operations supports offloading, so check the type of the
operation and return an error for other types.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:52:01 +01:00
Jeremy Sowden
4d57ca2be1 netfilter: bitwise: add helper for dumping boolean operations.
Split the code specific to dumping bitwise boolean operations out into a
separate function.  A similar function will be added later for shift
operations.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:52:00 +01:00
Jeremy Sowden
71d6ded3ac netfilter: bitwise: add helper for evaluating boolean operations.
Split the code specific to evaluating bitwise boolean operations out
into a separate function.  Similar functions will be added later for
shift operations.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:59 +01:00
Jeremy Sowden
3f8d9eb032 netfilter: bitwise: add helper for initializing boolean operations.
Split the code specific to initializing bitwise boolean operations out
into a separate function.  A similar function will be added later for
shift operations.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:58 +01:00
Jeremy Sowden
9d1f979986 netfilter: bitwise: add NFTA_BITWISE_OP netlink attribute.
Add a new bitwise netlink attribute, NFTA_BITWISE_OP, which is set to a
value of a new enum, nft_bitwise_ops.  It describes the type of
operation an expression contains.  Currently, it only has one value:
NFT_BITWISE_BOOL.  More values will be added later to implement shifts.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:57 +01:00
Jeremy Sowden
577c734a81 netfilter: bitwise: replace gotos with returns.
When dumping a bitwise expression, if any of the puts fails, we use goto
to jump to a label.  However, no clean-up is required and the only
statement at the label is a return.  Drop the goto's and return
immediately instead.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:57 +01:00
Jeremy Sowden
265ec7b0e8 netfilter: bitwise: remove NULL comparisons from attribute checks.
In later patches, we will be adding more checks.  In order to be
consistent and prevent complaints from checkpatch.pl, replace the
existing comparisons with NULL with logical NOT operators.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:56 +01:00
Jeremy Sowden
fbf19ddf39 netfilter: nf_tables: white-space fixes.
Indentation fixes for the parameters of a few nft functions.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:55 +01:00
Pablo Neira Ayuso
a7965d58dd netfilter: flowtable: add nf_flow_table_offload_cmd()
Split nf_flow_table_offload_setup() in two functions to make it more
maintainable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:54 +01:00
Pablo Neira Ayuso
ae29045018 netfilter: flowtable: add nf_flow_offload_tuple() helper
Consolidate code to configure the flow_cls_offload structure into one
helper function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:53 +01:00
Florian Westphal
28b3a4270c netfilter: hashlimit: do not use indirect calls during gc
no need, just use a simple boolean to indicate we want to reap all
entries.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:52 +01:00
Pablo Neira Ayuso
f698fe4082 netfilter: flowtable: refresh flow if hardware offload fails
If nf_flow_offload_add() fails to add the flow to hardware, then the
NF_FLOW_HW_REFRESH flag bit is set and the flow remains in the flowtable
software path.

If flowtable hardware offload is enabled, this patch enqueues a new
request to offload this flow to hardware.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:52 +01:00
Pablo Neira Ayuso
a5449cdcaa netfilter: flowtable: add nf_flowtable_hw_offload() helper function
This function checks for the NF_FLOWTABLE_HW_OFFLOAD flag, meaning that
the flowtable hardware offload is enabled.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:51 +01:00
Pablo Neira Ayuso
355a8b13f8 netfilter: flowtable: use atomic bitwise operations for flow flags
Originally, all flow flag bits were set on only from the workqueue. With
the introduction of the flow teardown state and hardware offload this is
no longer true. Let's be safe and use atomic bitwise operation to
operation with flow flags.

Fixes: 59c466dd68 ("netfilter: nf_flow_table: add a new flow state for tearing down offloading")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:50 +01:00
Pablo Neira Ayuso
445db8d096 netfilter: flowtable: remove dying bit, use teardown bit instead
The dying bit removes the conntrack entry if the netdev that owns this
flow is going down. Instead, use the teardown mechanism to push back the
flow to conntrack to let the classic software path decide what to do
with it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:49 +01:00
Pablo Neira Ayuso
87265d842c netfilter: flowtable: add nf_flow_offload_work_alloc()
Add helper function to allocate and initialize flow offload work and use
it to consolidate existing code.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:48 +01:00
Pablo Neira Ayuso
a7521a60a5 netfilter: flowtable: restrict flow dissector match on meta ingress device
Set on FLOW_DISSECTOR_KEY_META meta key using flow tuple ingress interface.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:48 +01:00
Pablo Neira Ayuso
79b9b685dd netfilter: flowtable: fetch stats only if flow is still alive
Do not fetch statistics if flow has expired since it might not in
hardware anymore. After this update, remove the FLOW_OFFLOAD_HW_DYING
check from nf_flow_offload_stats() since this flag is never set on.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2020-01-16 15:51:47 +01:00
Hans Westgaard Ry
c4c86abb3f net/rds: Detect need of On-Demand-Paging memory registration
Add code to check if memory intended for RDMA is FS-DAX-memory. RDS
will fail with error code EOPNOTSUPP if FS-DAX-memory is detected.

Signed-off-by: Hans Westgaard Ry <hans.westgaard.ry@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
2020-01-16 16:15:25 +02:00
Eyal Birger
61177e911d netfilter: nat: fix ICMP header corruption on ICMP errors
Commit 8303b7e8f0 ("netfilter: nat: fix spurious connection timeouts")
made nf_nat_icmp_reply_translation() use icmp_manip_pkt() as the l4
manipulation function for the outer packet on ICMP errors.

However, icmp_manip_pkt() assumes the packet has an 'id' field which
is not correct for all types of ICMP messages.

This is not correct for ICMP error packets, and leads to bogus bytes
being written the ICMP header, which can be wrongfully regarded as
'length' bytes by RFC 4884 compliant receivers.

Fix by assigning the 'id' field only for ICMP messages that have this
semantic.

Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Fixes: 8303b7e8f0 ("netfilter: nat: fix spurious connection timeouts")
Signed-off-by: Eyal Birger <eyal.birger@gmail.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:08:25 +01:00
Florian Westphal
335178d542 netfilter: nf_tables: fix flowtable list del corruption
syzbot reported following crash:

  list_del corruption, ffff88808c9bb000->prev is LIST_POISON2 (dead000000000122)
  [..]
  Call Trace:
   __list_del_entry include/linux/list.h:131 [inline]
   list_del_rcu include/linux/rculist.h:148 [inline]
   nf_tables_commit+0x1068/0x3b30 net/netfilter/nf_tables_api.c:7183
   [..]

The commit transaction list has:

NFT_MSG_NEWTABLE
NFT_MSG_NEWFLOWTABLE
NFT_MSG_DELFLOWTABLE
NFT_MSG_DELTABLE

A missing generation check during DELTABLE processing causes it to queue
the DELFLOWTABLE operation a second time, so we corrupt the list here:

  case NFT_MSG_DELFLOWTABLE:
     list_del_rcu(&nft_trans_flowtable(trans)->list);
     nf_tables_flowtable_notify(&trans->ctx,

because we have two different DELFLOWTABLE transactions for the same
flowtable.  We then call list_del_rcu() twice for the same flowtable->list.

The object handling seems to suffer from the same bug so add a generation
check too and only queue delete transactions for flowtables/objects that
are still active in the next generation.

Reported-by: syzbot+37a6804945a3a13b1572@syzkaller.appspotmail.com
Fixes: 3b49e2e94e ("netfilter: nf_tables: add flow table netlink frontend")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 14:22:33 +01:00
Dan Carpenter
cd77e75b5e netfilter: nf_tables: fix memory leak in nf_tables_parse_netdev_hooks()
Syzbot detected a leak in nf_tables_parse_netdev_hooks().  If the hook
already exists, then the error handling doesn't free the newest "hook".

Reported-by: syzbot+f9d4095107fc8749c69c@syzkaller.appspotmail.com
Fixes: b75a3e8371 ("netfilter: nf_tables: allow netdevice to be used only once per flowtable")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 14:22:32 +01:00
Florian Westphal
9332d27d79 netfilter: nf_tables: remove WARN and add NLA_STRING upper limits
This WARN can trigger because some of the names fed to the module
autoload function can be of arbitrary length.

Remove the WARN and add limits for all NLA_STRING attributes.

Reported-by: syzbot+0e63ae76d117ae1c3a01@syzkaller.appspotmail.com
Fixes: 452238e8d5 ("netfilter: nf_tables: add and use helper for module autoload")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 14:22:32 +01:00
Florian Westphal
9ec22d7c6c netfilter: nft_tunnel: ERSPAN_VERSION must not be null
Fixes: af308b94a2 ("netfilter: nf_tables: add tunnel support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 14:22:32 +01:00
Florian Westphal
1c702bf902 netfilter: nft_tunnel: fix null-attribute check
else we get null deref when one of the attributes is missing, both
must be non-null.

Reported-by: syzbot+76d0b80493ac881ff77b@syzkaller.appspotmail.com
Fixes: aaecfdb5c5 ("netfilter: nf_tables: match on tunnel metadata")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 14:22:32 +01:00
Pablo Neira Ayuso
ec7470b834 netfilter: nf_tables: store transaction list locally while requesting module
This patch fixes a WARN_ON in nft_set_destroy() due to missing
set reference count drop from the preparation phase. This is triggered
by the module autoload path. Do not exercise the abort path from
nft_request_module() while preparation phase cleaning up is still
pending.

 WARNING: CPU: 3 PID: 3456 at net/netfilter/nf_tables_api.c:3740 nft_set_destroy+0x45/0x50 [nf_tables]
 [...]
 CPU: 3 PID: 3456 Comm: nft Not tainted 5.4.6-arch3-1 #1
 RIP: 0010:nft_set_destroy+0x45/0x50 [nf_tables]
 Code: e8 30 eb 83 c6 48 8b 85 80 00 00 00 48 8b b8 90 00 00 00 e8 dd 6b d7 c5 48 8b 7d 30 e8 24 dd eb c5 48 89 ef 5d e9 6b c6 e5 c5 <0f> 0b c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 8b 7f 10 e9 52
 RSP: 0018:ffffac4f43e53700 EFLAGS: 00010202
 RAX: 0000000000000001 RBX: ffff99d63a154d80 RCX: 0000000001f88e03
 RDX: 0000000001f88c03 RSI: ffff99d6560ef0c0 RDI: ffff99d63a101200
 RBP: ffff99d617721de0 R08: 0000000000000000 R09: 0000000000000318
 R10: 00000000f0000000 R11: 0000000000000001 R12: ffffffff880fabf0
 R13: dead000000000122 R14: dead000000000100 R15: ffff99d63a154d80
 FS:  00007ff3dbd5b740(0000) GS:ffff99d6560c0000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 00001cb5de6a9000 CR3: 000000016eb6a004 CR4: 00000000001606e0
 Call Trace:
  __nf_tables_abort+0x3e3/0x6d0 [nf_tables]
  nft_request_module+0x6f/0x110 [nf_tables]
  nft_expr_type_request_module+0x28/0x50 [nf_tables]
  nf_tables_expr_parse+0x198/0x1f0 [nf_tables]
  nft_expr_init+0x3b/0xf0 [nf_tables]
  nft_dynset_init+0x1e2/0x410 [nf_tables]
  nf_tables_newrule+0x30a/0x930 [nf_tables]
  nfnetlink_rcv_batch+0x2a0/0x640 [nfnetlink]
  nfnetlink_rcv+0x125/0x171 [nfnetlink]
  netlink_unicast+0x179/0x210
  netlink_sendmsg+0x208/0x3d0
  sock_sendmsg+0x5e/0x60
  ____sys_sendmsg+0x21b/0x290

Update comment on the code to describe the new behaviour.

Reported-by: Marco Oliverio <marco.oliverio@tanaza.com>
Fixes: 452238e8d5 ("netfilter: nf_tables: add and use helper for module autoload")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 14:21:51 +01:00
Alexander Lobakin
bd5874da57 net: dsa: tag_qca: fix doubled Tx statistics
DSA subsystem takes care of netdev statistics since commit 4ed70ce9f0
("net: dsa: Refactor transmit path to eliminate duplication"), so
any accounting inside tagger callbacks is redundant and can lead to
messing up the stats.
This bug is present in Qualcomm tagger since day 0.

Fixes: cafdc45c94 ("net-next: dsa: add Qualcomm tag RX/TX handler")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Alexander Lobakin <alobakin@dlink.ru>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-16 13:59:40 +01:00
Alexander Lobakin
ad32205470 net: dsa: tag_gswip: fix typo in tagger name
The correct name is GSWIP (Gigabit Switch IP). Typo was introduced in
875138f81d ("dsa: Move tagger name into its ops structure") while
moving tagger names to their structures.

Fixes: 875138f81d ("dsa: Move tagger name into its ops structure")
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Alexander Lobakin <alobakin@dlink.ru>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-16 13:58:26 +01:00
David S. Miller
3981f955eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-01-15

The following pull-request contains BPF updates for your *net* tree.

We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 95 insertions(+), 43 deletions(-).

The main changes are:

1) Fix refcount leak for TCP time wait and request sockets for socket lookup
   related BPF helpers, from Lorenz Bauer.

2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.

3) Batch of several sockmap and related TLS fixes found while operating
   more complex BPF programs with Cilium and OpenSSL, from John Fastabend.

4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
   to avoid purging data upon teardown, from Lingpeng Chen.

5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
   dump a BPF map's value with BTF, from Martin KaFai Lau.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-16 10:04:40 +01:00
Marcel Holtmann
117717e574 Bluetooth: Increment management interface revision
Increment the mgmt revision due to the recently added commands.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-01-16 09:40:47 +02:00
John Fastabend
7361d44896 bpf: Sockmap/tls, fix pop data with SK_DROP return code
When user returns SK_DROP we need to reset the number of copied bytes
to indicate to the user the bytes were dropped and not sent. If we
don't reset the copied arg sendmsg will return as if those bytes were
copied giving the user a positive return value.

This works as expected today except in the case where the user also
pops bytes. In the pop case the sg.size is reduced but we don't correctly
account for this when copied bytes is reset. The popped bytes are not
accounted for and we return a small positive value potentially confusing
the user.

The reason this happens is due to a typo where we do the wrong comparison
when accounting for pop bytes. In this fix notice the if/else is not
needed and that we have a similar problem if we push data except its not
visible to the user because if delta is larger the sg.size we return a
negative value so it appears as an error regardless.

Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-9-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend
9aaaa56845 bpf: Sockmap/tls, skmsg can have wrapped skmsg that needs extra chaining
Its possible through a set of push, pop, apply helper calls to construct
a skmsg, which is just a ring of scatterlist elements, with the start
value larger than the end value. For example,

      end       start
  |_0_|_1_| ... |_n_|_n+1_|

Where end points at 1 and start points and n so that valid elements is
the set {n, n+1, 0, 1}.

Currently, because we don't build the correct chain only {n, n+1} will
be sent. This adds a check and sg_chain call to correctly submit the
above to the crypto and tls send path.

Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-8-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend
d468e4775c bpf: Sockmap/tls, tls_sw can create a plaintext buf > encrypt buf
It is possible to build a plaintext buffer using push helper that is larger
than the allocated encrypt buffer. When this record is pushed to crypto
layers this can result in a NULL pointer dereference because the crypto
API expects the encrypt buffer is large enough to fit the plaintext
buffer. Kernel splat below.

To resolve catch the cases this can happen and split the buffer into two
records to send individually. Unfortunately, there is still one case to
handle where the split creates a zero sized buffer. In this case we merge
the buffers and unmark the split. This happens when apply is zero and user
pushed data beyond encrypt buffer. This fixes the original case as well
because the split allocated an encrypt buffer larger than the plaintext
buffer and the merge simply moves the pointers around so we now have
a reference to the new (larger) encrypt buffer.

Perhaps its not ideal but it seems the best solution for a fixes branch
and avoids handling these two cases, (a) apply that needs split and (b)
non apply case. The are edge cases anyways so optimizing them seems not
necessary unless someone wants later in next branches.

[  306.719107] BUG: kernel NULL pointer dereference, address: 0000000000000008
[...]
[  306.747260] RIP: 0010:scatterwalk_copychunks+0x12f/0x1b0
[...]
[  306.770350] Call Trace:
[  306.770956]  scatterwalk_map_and_copy+0x6c/0x80
[  306.772026]  gcm_enc_copy_hash+0x4b/0x50
[  306.772925]  gcm_hash_crypt_remain_continue+0xef/0x110
[  306.774138]  gcm_hash_crypt_continue+0xa1/0xb0
[  306.775103]  ? gcm_hash_crypt_continue+0xa1/0xb0
[  306.776103]  gcm_hash_assoc_remain_continue+0x94/0xa0
[  306.777170]  gcm_hash_assoc_continue+0x9d/0xb0
[  306.778239]  gcm_hash_init_continue+0x8f/0xa0
[  306.779121]  gcm_hash+0x73/0x80
[  306.779762]  gcm_encrypt_continue+0x6d/0x80
[  306.780582]  crypto_gcm_encrypt+0xcb/0xe0
[  306.781474]  crypto_aead_encrypt+0x1f/0x30
[  306.782353]  tls_push_record+0x3b9/0xb20 [tls]
[  306.783314]  ? sk_psock_msg_verdict+0x199/0x300
[  306.784287]  bpf_exec_tx_verdict+0x3f2/0x680 [tls]
[  306.785357]  tls_sw_sendmsg+0x4a3/0x6a0 [tls]

test_sockmap test signature to trigger bug,

[TEST]: (1, 1, 1, sendmsg, pass,redir,start 1,end 2,pop (1,2),ktls,):

Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-7-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend
cf21e9ba1e bpf: Sockmap/tls, msg_push_data may leave end mark in place
Leaving an incorrect end mark in place when passing to crypto
layer will cause crypto layer to stop processing data before
all data is encrypted. To fix clear the end mark on push
data instead of expecting users of the helper to clear the
mark value after the fact.

This happens when we push data into the middle of a skmsg and
have room for it so we don't do a set of copies that already
clear the end flag.

Fixes: 6fff607e2f ("bpf: sk_msg program helper bpf_msg_push_data")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-6-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend
6562e29cf6 bpf: Sockmap, skmsg helper overestimates push, pull, and pop bounds
In the push, pull, and pop helpers operating on skmsg objects to make
data writable or insert/remove data we use this bounds check to ensure
specified data is valid,

 /* Bounds checks: start and pop must be inside message */
 if (start >= offset + l || last >= msg->sg.size)
     return -EINVAL;

The problem here is offset has already included the length of the
current element the 'l' above. So start could be past the end of
the scatterlist element in the case where start also points into an
offset on the last skmsg element.

To fix do the accounting slightly different by adding the length of
the previous entry to offset at the start of the iteration. And
ensure its initialized to zero so that the first iteration does
nothing.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Fixes: 6fff607e2f ("bpf: sk_msg program helper bpf_msg_push_data")
Fixes: 7246d8ed4d ("bpf: helper to pop data from messages")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-5-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend
33bfe20dd7 bpf: Sockmap/tls, push write_space updates through ulp updates
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.

This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.

Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
John Fastabend
7e81a35302 bpf: Sockmap, ensure sock lock held during tear down
The sock_map_free() and sock_hash_free() paths used to delete sockmap
and sockhash maps walk the maps and destroy psock and bpf state associated
with the socks in the map. When done the socks no longer have BPF programs
attached and will function normally. This can happen while the socks in
the map are still "live" meaning data may be sent/received during the walk.

Currently, though we don't take the sock_lock when the psock and bpf state
is removed through this path. Specifically, this means we can be writing
into the ops structure pointers such as sendmsg, sendpage, recvmsg, etc.
while they are also being called from the networking side. This is not
safe, we never used proper READ_ONCE/WRITE_ONCE semantics here if we
believed it was safe. Further its not clear to me its even a good idea
to try and do this on "live" sockets while networking side might also
be using the socket. Instead of trying to reason about using the socks
from both sides lets realize that every use case I'm aware of rarely
deletes maps, in fact kubernetes/Cilium case builds map at init and
never tears it down except on errors. So lets do the simple fix and
grab sock lock.

This patch wraps sock deletes from maps in sock lock and adds some
annotations so we catch any other cases easier.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-3-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
David S. Miller
8fec380ac0 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - fix typo and kerneldocs, by Sven Eckelmann
 
  - use WiFi txbitrate for B.A.T.M.A.N. V as fallback, by René Treffer
 
  - silence some endian sparse warnings by adding annotations,
    by Sven Eckelmann
 
  - Update copyright years to 2020, by Sven Eckelmann
 
  - Disable deprecated sysfs configuration by default, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl4dzoQWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeodEbD/0fM2O/hx/CsBoWDt9rU+tA/Nh8
 CWYUn9fjghG3RpSCzIjqpcwSe4T09KHw7BvPBiWpVNmUYeNz7aLmujbaj1DmAY8l
 l50VHtJYM1aCIC8O3UcqznrUtoyEdAi78gpxUXLVNABJ9t9gkpXaS3bM06gs5QW4
 7Y5KSpN9O7a7QrLByb8KZUuZrHCSUzrVJHkc3eZ4FqAk3jRw0LjaMIkoQaKImc9D
 wOUTU0rUvuTNBZ9jteChWkfpHwa1QrRgoxYiw5d5u7FT7yOZE1NoQUbm0ZzhdMrO
 QLkcdN8n6gI8waDP47cxwNmSRpRS8S7PIyG2MSi1LNoENUcdHuPuPFAnBbXWyUqc
 QAXSe+JS2MKAT13Tt8+y34WWCtT94K7Hko1SbT4WpawdF1G+DzT2BNI9p+kOJLRV
 Ap7OHTwrZHqkTyQLbMEYdfIPFwMilYEc+O24kjOw6OCSt45F1l9bad4IwvE/iASb
 HNzLO2Jlj5gUMXShbcIzC3PGYSMWpZKoX3075SqOXjcOKfF/jUYfaUt8WSUvi1bA
 CcEMD7FMrVa0mBkhG2WUO9f/ARlfTckhg/ijD4NSDLy7CsjhWXZL7lC9N7R43ktM
 LThTJ+ey1P+K6q1OH4ytvlApSwHC1OdaA6zUhoht9XQDHrnLf6gDpTDd4xn8zKPZ
 BXbGv9bCi3/df6ocCw==
 =xeMk
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20200114' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - fix typo and kerneldocs, by Sven Eckelmann

 - use WiFi txbitrate for B.A.T.M.A.N. V as fallback, by René Treffer

 - silence some endian sparse warnings by adding annotations,
   by Sven Eckelmann

 - Update copyright years to 2020, by Sven Eckelmann

 - Disable deprecated sysfs configuration by default, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 23:04:04 +01:00
David S. Miller
5a40420e04 Here is a batman-adv bugfix:
- Fix DAT candidate selection on little endian systems,
    by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl4dzQUWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoQFuD/44PUv3jHW9S8aIu5FobWHlswOg
 e1p9g98QdO7dezkkQa64tuFMR6mgVefOeK2+LHCZQ3Svd6xeGaxEGNo6vdzrkyrj
 iRl/5nY8hRZ3jIdD7kOG11yX/ra+3BSWMWS4yu0OAawYFMYxakW6qUR7OUaqL9xv
 Qr/XHu7S/tpy4MGUDabu11KlJGyZBfyXQxEKmSjjfWVITqVKNeoZONFqm2lmLOHs
 RppvTjzdTf3HUPp6C9rC1b6phN7G8XLRXqXrgY4LzTDmwHbOmPvJQ1t+kh9dfZuX
 DYsv89dl4iVIcSUY9fyUA9YxSHRyqiXRDDG0DPIy0AqRrHy0HgdGTlwCkS1idl1H
 okiuO5rUzpDFGG5F8BRKY2t29JN8BO+/YVZpa5bl9ZgQQpnuBu+kkbceYCHY0lat
 h2y3AMsJUcWEbFsa/ulEmHw8aj38rE4lgd7lhLir04BFVZHVB+faH78MQ7xB7+Xl
 lPl4ptf3l3mwWW/tXbuWa6bXjCg5UBAOr4fEyhrg8fM0rst8LG6trWZWlWgZykLe
 3HSnaR00tH8DiMPf/Zl7tAX/dKCmzCo6h7mXUAfvw1PktOvWC1JhtjchJASOqnoP
 peD45G5W3DGgxtmfcl6J33BmMllbxeT1N6sQRApliYvcrrMZ9gVzr2sHFxHftFXl
 kyCpbAlfAZ6ceGYCuQ==
 =6zKP
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20200114' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here is a batman-adv bugfix:

 - Fix DAT candidate selection on little endian systems,
   by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 22:44:23 +01:00
Pengcheng Yang
e176b1ba47 tcp: fix marked lost packets not being retransmitted
When the packet pointed to by retransmit_skb_hint is unlinked by ACK,
retransmit_skb_hint will be set to NULL in tcp_clean_rtx_queue().
If packet loss is detected at this time, retransmit_skb_hint will be set
to point to the current packet loss in tcp_verify_retransmit_hint(),
then the packets that were previously marked lost but not retransmitted
due to the restriction of cwnd will be skipped and cannot be
retransmitted.

To fix this, when retransmit_skb_hint is NULL, retransmit_skb_hint can
be reset only after all marked lost packets are retransmitted
(retrans_out >= lost_out), otherwise we need to traverse from
tcp_rtx_queue_head in tcp_xmit_retransmit_queue().

Packetdrill to demonstrate:

// Disable RACK and set max_reordering to keep things simple
    0 `sysctl -q net.ipv4.tcp_recovery=0`
   +0 `sysctl -q net.ipv4.tcp_max_reordering=3`

// Establish a connection
   +0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

  +.1 < S 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > S. 0:0(0) ack 1 <...>
 +.01 < . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

// Send 8 data segments
   +0 write(4, ..., 8000) = 8000
   +0 > P. 1:8001(8000) ack 1

// Enter recovery and 1:3001 is marked lost
 +.01 < . 1:1(0) ack 1 win 257 <sack 3001:4001,nop,nop>
   +0 < . 1:1(0) ack 1 win 257 <sack 5001:6001 3001:4001,nop,nop>
   +0 < . 1:1(0) ack 1 win 257 <sack 5001:7001 3001:4001,nop,nop>

// Retransmit 1:1001, now retransmit_skb_hint points to 1001:2001
   +0 > . 1:1001(1000) ack 1

// 1001:2001 was ACKed causing retransmit_skb_hint to be set to NULL
 +.01 < . 1:1(0) ack 2001 win 257 <sack 5001:8001 3001:4001,nop,nop>
// Now retransmit_skb_hint points to 4001:5001 which is now marked lost

// BUG: 2001:3001 was not retransmitted
   +0 > . 2001:3001(1000) ack 1

Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 22:34:31 +01:00
Luiz Augusto von Dentz
1b1d29e514 Bluetooth: Make use of __check_timeout on hci_sched_le
This reuse __check_timeout on hci_sched_le following the same logic
used hci_sched_acl.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-15 22:28:51 +01:00
Luiz Augusto von Dentz
f9a619db7c Bluetooth: monitor: Add support for ISO packets
This enables passing ISO packets to the monitor socket.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-15 22:28:51 +01:00
Alain Michaud
600a87490f Bluetooth: Implementation of MGMT_OP_SET_BLOCKED_KEYS.
MGMT command is added to receive the list of blocked keys from
user-space.

The list is used to:
1) Block keys from being distributed by the device during
   the ke distribution phase of SMP.
2) Filter out any keys that were previously saved so
   they are no longer used.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-15 22:28:41 +01:00
Magnus Karlsson
d3a56931f9 xsk: Support allocations of large umems
When registering a umem area that is sufficiently large (>1G on an
x86), kmalloc cannot be used to allocate one of the internal data
structures, as the size requested gets too large. Use kvmalloc instead
that falls back on vmalloc if the allocation is too large for kmalloc.

Also add accounting for this structure as it is triggered by a user
space action (the XDP_UMEM_REG setsockopt) and it is by far the
largest structure of kernel allocated memory in xsk.

Reported-by: Ryan Goodfellow <rgoodfel@isi.edu>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1578995365-7050-1-git-send-email-magnus.karlsson@intel.com
2020-01-15 11:41:52 -08:00
Trond Myklebust
b32d285539 SUNRPC: Remove broken gss_mech_list_pseudoflavors()
Remove gss_mech_list_pseudoflavors() and its callers. This is part of
an unused API, and could leak an RCU reference if it were ever called.

Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
e515dd9d76 xprtrdma: DMA map rr_rdma_buf as each rpcrdma_rep is created
Clean up: This simplifies the logic in rpcrdma_post_recvs.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
b7ff0185e9 xprtrdma: Destroy reps from previous connection instance
To safely get rid of all rpcrdma_reps from a particular connection
instance, xprtrdma has to wait until each of those reps is finished
being used. A rep may be backing the rq_rcv_buf of an RPC that has
just completed, for example.

Since it is safe to invoke rpcrdma_rep_destroy() only in the Receive
completion handler, simply mark reps remaining in the rb_all_reps
list after the transport is drained. These will then be deleted as
rpcrdma_post_recvs pulls them off the rep free list.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
85810388a9 xprtrdma: Destroy rpcrdma_rep when Receive is flushed
This reduces the hardware and memory footprint of an unconnected
transport.

At some point in the future, transport reconnect will allow
resolving the destination IP address through a different device. The
current change enables reps for the new connection to be allocated
on whichever NUMA node the new device affines to after a reconnect.

Note that this does not destroy _all_ the transport's reps... there
will be a few that are still part of a running RPC completion.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
b78de1dca0 xprtrdma: Allocate and map transport header buffers at connect time
Currently the underlying RDMA device is chosen at transport set-up
time. But it will soon be at connect time instead.

The maximum size of a transport header is based on device
capabilities. Thus transport header buffers have to be allocated
_after_ the underlying device has been chosen (via address and route
resolution); ie, in the connect worker.

Thus, move the allocation of transport header buffers to the connect
worker, after the point at which the underlying RDMA device has been
chosen.

This also means the RDMA device is available to do a DMA mapping of
these buffers at connect time, instead of in the hot I/O path. Make
that optimization as well.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
25868e610a xprtrdma: Refactor frwr_is_supported
Refactor: Perform the "is supported" check in rpcrdma_ep_create()
instead of in rpcrdma_ia_open(). frwr_open() is where most of the
logic to query device attributes is already located.

The current code displays a redundant error message when the device
does not support FRWR. As an additional clean-up, this patch removes
the extra message.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
18d065a5d4 xprtrdma: Eliminate per-transport "max pages"
To support device hotplug and migrating a connection between devices
of different capabilities, we have to guarantee that all in-kernel
devices can support the same max NFS payload size (1 megabyte).

This means that possibly one or two in-tree devices are no longer
supported for NFS/RDMA because they cannot support 1MB rsize/wsize.
The only one I confirmed was cxgb3, but it has already been removed
from the kernel.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
7581d90109 xprtrdma: Refactor initialization of ep->rep_max_requests
Clean up: there is no need to keep two copies of the same value.
Also, in subsequent patches, rpcrdma_ep_create() will be called in
the connect worker rather than at set-up time.

Minor fix: Initialize the transport's sendctx to the value based on
the capabilities of the underlying device, not the maximum setting.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
cb586decbb xprtrdma: Make sendctx queue lifetime the same as connection lifetime
The size of the sendctx queue depends on the value stored in
ia->ri_max_send_sges. This value is determined by querying the
underlying device.

Eventually, rpcrdma_ia_open() and rpcrdma_ep_create() will be called
in the connect worker rather than at transport set-up time. The
underlying device will not have been chosen device set-up time.

The sendctx queue will thus have to be created after the underlying
device has been chosen via address and route resolution; in other
words, in the connect worker.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Chuck Lever
2e87036814 xprtrdma: Eliminate ri_max_send_sges
Clean-up. The max_send_sge value also happens to be stored in
ep->rep_attr. Let's keep just a single copy.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:32 -05:00
Julia Lawall
c2bd2c0a55 SUNRPC: constify copied structure
The empty_iov structure is only copied into another structure,
so make it const.

The opportunity for this change was found using Coccinelle.

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:31 -05:00
Chuck Lever
b8457606d9 SUNRPC: call_connect_status should handle -EPROTO
The xprtrdma connect logic can return -EPROTO if the underlying
device or network path does not support RDMA. This can happen
after a device removal/insertion.

- When SOFTCONN is set, EPROTO is a permanent error.

- When SOFTCONN is not set, EPROTO is treated as a temporary error.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:31 -05:00
Chuck Lever
abf8af78a6 SUNRPC: Capture signalled RPC tasks
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:31 -05:00
Arnd Bergmann
52879b464a sunrpc: convert to time64_t for expiry
Using signed 32-bit types for UTC time leads to the y2038 overflow,
which is what happens in the sunrpc code at the moment.

This changes the sunrpc code over to use time64_t where possible.
The one exception is the gss_import_v{1,2}_context() function for
kerberos5, which uses 32-bit timestamps in the protocol. Here,
we can at least treat the numbers as 'unsigned', which extends the
range from 2038 to 2106.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-15 10:54:30 -05:00
Nikolay Aleksandrov
f545923b4a net: bridge: vlan: notify on vlan add/delete/change flags
Now that we can notify, send a notification on add/del or change of flags.
Notifications are also compressed when possible to reduce their number
and relieve user-space of extra processing, due to that we have to
manually notify after each add/del in order to avoid double
notifications. We try hard to notify only about the vlans which actually
changed, thus a single command can result in multiple notifications
about disjoint ranges if there were vlans which didn't change inside.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:18 +01:00
Nikolay Aleksandrov
cf5bddb95c net: bridge: vlan: add rtnetlink group and notify support
Add a new rtnetlink group for bridge vlan notifications - RTNLGRP_BRVLAN
and add support for sending vlan notifications (both single and ranges).
No functional changes intended, the notification support will be used by
later patches.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:18 +01:00
Nikolay Aleksandrov
0ab5587951 net: bridge: vlan: add rtm range support
Add a new vlandb nl attribute - BRIDGE_VLANDB_ENTRY_RANGE which causes
RTM_NEWVLAN/DELVAN to act on a range. Dumps now automatically compress
similar vlans into ranges. This will be also used when per-vlan options
are introduced and vlans' options match, they will be put into a single
range which is encapsulated in one netlink attribute. We need to run
similar checks as br_process_vlan_info() does because these ranges will
be used for options setting and they'll be able to skip
br_process_vlan_info().

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:18 +01:00
Nikolay Aleksandrov
adb3ce9bcb net: bridge: vlan: add del rtm message support
Adding RTM_DELVLAN support similar to RTM_NEWVLAN is simple, just need to
map DELVLAN to DELLINK and register the handler.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:17 +01:00
Nikolay Aleksandrov
f26b296585 net: bridge: vlan: add new rtm message support
Add initial RTM_NEWVLAN support which can only create vlans, operating
similar to the current br_afspec(). We will use it later to also change
per-vlan options. Old-style (flag-based) vlan ranges are not allowed
when using RTM messages, we will introduce vlan ranges later via a new
nested attribute which would allow us to have all the information about a
range encapsulated into a single nl attribute.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:17 +01:00
Nikolay Aleksandrov
8dcea18708 net: bridge: vlan: add rtm definitions and dump support
This patch adds vlan rtm definitions:
 - NEWVLAN: to be used for creating vlans, setting options and
   notifications
 - DELVLAN: to be used for deleting vlans
 - GETVLAN: used for dumping vlan information

Dumping vlans which can span multiple messages is added now with basic
information (vid and flags). We use nlmsg_parse() to validate the header
length in order to be able to extend the message with filtering
attributes later.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:17 +01:00
Nikolay Aleksandrov
8f4cc940a1 net: bridge: netlink: add extack error messages when processing vlans
Add extack messages on vlan processing errors. We need to move the flags
missing check after the "last" check since we may have "last" set but
lack a range end flag in the next entry.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:17 +01:00
Nikolay Aleksandrov
5a46facbbc net: bridge: vlan: add helpers to check for vlan id/range validity
Add helpers to check if a vlan id or range are valid. The range helper
must be called when range start or end are detected.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 13:48:17 +01:00
David S. Miller
eb507906fe A few fixes:
* -O3 enablement fallout, thanks to Arnd who ran this
  * fixes for a few leaks, thanks to Felix
  * channel 12 regulatory fix for custom regdomains
  * check for a crash reported by syzbot
    (NULL function is called on drivers that don't have it)
  * fix TKIP replay protection after setup with some APs
    (from Jouni)
  * restrict obtaining some mesh data to avoid WARN_ONs
  * fix deadlocks with auto-disconnect (socket owner)
  * fix radar detection events with multiple devices
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl4e4voACgkQB8qZga/f
 l8RVFg/+Ngjzje8xU6Ymh7VSwvBBz6CmSVwH/nRSsIAszDrQLoDCo7+eWeUHnAKG
 2ViFQ4Lf5CP2J9kSgi7pbD0yGQRDGxH1CVSxMB64uZZRErlOCZYLRTIeBn7HGb8w
 JhqAVXBjskzP9iB5jUZIkOgJ6L2ESM2KYKQVGtgn8p9fFdHKitec8UEKiPU3tcX1
 zx94EbKldynuvDNb7VkTm4MdoY8LK5KI8ovVdVKwJ05qCtLvc6xvEBYhm33Fl6Pe
 66cWklCrU093jQXUDTk0yBkwlcXGxJc143y1nXuurI8ZLH6q2wdDUAMHUEKdL6mz
 ktHxABK4JA4Z/ylphR2Tp1TNh8xnz/ZZ5IdAe4P9K0/PerrjH8Yk7zf/48lrHTSv
 LKb86Oaoq4MY4k5UNyjDUdgPB/q8CgxN2xDQaeQX69u740wrlxYtrwgHmqHrKI69
 ppCzL7a3wmJzAz/97TMFhSseCH1WEQfWY+tlzktrvRofSgOGsIUtcPP10NmvGZHj
 zdF79E9TiUPX4KVWssRb9A2MjBbvKA31hCQbH/OVKSWt/fItoBUbMQKtPxfnHcam
 /W/gCWrOpw5Qqhc/BdFbYYsgY9J06KzifsV5e2Z6he+6Ex+rtdZNoLXc6/4QH3+9
 HzsVYBP+ZUaqrMJi64KDN59C3ni4jDVs9zfYnVkC9LhzHASI5EE=
 =i20C
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few fixes:
 * -O3 enablement fallout, thanks to Arnd who ran this
 * fixes for a few leaks, thanks to Felix
 * channel 12 regulatory fix for custom regdomains
 * check for a crash reported by syzbot
   (NULL function is called on drivers that don't have it)
 * fix TKIP replay protection after setup with some APs
   (from Jouni)
 * restrict obtaining some mesh data to avoid WARN_ONs
 * fix deadlocks with auto-disconnect (socket owner)
 * fix radar detection events with multiple devices
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 04:12:00 -08:00
Ulrich Weber
4e4362d2bf xfrm: support output_mark for offload ESP packets
Commit 9b42c1f179 ("xfrm: Extend the output_mark") added output_mark
support but missed ESP offload support.

xfrm_smark_get() is not called within xfrm_input() for packets coming
from esp4_gro_receive() or esp6_gro_receive(). Therefore call
xfrm_smark_get() directly within these functions.

Fixes: 9b42c1f179 ("xfrm: Extend the output_mark to support input direction and masking.")
Signed-off-by: Ulrich Weber <ulrich.weber@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-15 12:18:35 +01:00
Thomas Pedersen
30b2f0be23 mac80211: add ieee80211_is_any_nullfunc()
commit 08a5bdde38 ("mac80211: consider QoS Null frames for STA_NULLFUNC_ACKED")
Fixed a bug where we failed to take into account a
nullfunc frame can be either non-QoS or QoS. It turns out
there is at least one more bug in
ieee80211_sta_tx_notify(), introduced in
commit 7b6ddeaf27 ("mac80211: use QoS NDP for AP probing"),
where we forgot to check for the QoS variant and so
assumed the QoS nullfunc frame never went out

Fix this by adding a helper ieee80211_is_any_nullfunc()
which consolidates the check for non-QoS and QoS nullfunc
frames. Replace existing compound conditionals and add a
couple more missing checks for QoS variant.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200114055940.18502-3-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 11:20:13 +01:00
Markus Theil
5972fa15b9 mac80211: fix tx status for no ack cases
Before this patch, frames which where successfully transmitted without
requiring acks where accounted as lost frames.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20191218142736.15843-1-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 11:19:20 +01:00
John Crispin
dd56e90230 mac80211: add handling for BSS color
It is now possible to propagate BSS color settings into the subsystem. Lets
make mac80211 also handle them so that we can send them further down the
stack into the drivers. We drop the old bss_color field and change iwlwifi
to use the new he_bss_color struct.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191217141921.8114-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 11:18:50 +01:00
John Crispin
5c5e52d1bb nl80211: add handling for BSS color
This patch adds the attributes, policy and parsing code to allow userland
to send the info about the BSS coloring settings to the kernel.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191217141921.8114-1-john@phrozen.org
[johannes: remove the strict policy parsing, that was a misunderstanding]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 11:14:24 +01:00
Felix Fietkau
81c044fc3b cfg80211: fix page refcount issue in A-MSDU decap
The fragments attached to a skb can be part of a compound page. In that case,
page_ref_inc will increment the refcount for the wrong page. Fix this by
using get_page instead, which calls page_ref_inc on the compound head and
also checks for overflow.

Fixes: 2b67f944f8 ("cfg80211: reuse existing page fragments in A-MSDU rx")
Cc: stable@vger.kernel.org
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200113182107.20461-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:35 +01:00
Johannes Berg
24953de0a5 cfg80211: check for set_wiphy_params
Check if set_wiphy_params is assigned and return an error if not,
some drivers (e.g. virt_wifi where syzbot reported it) don't have
it.

Reported-by: syzbot+e8a797964a4180eb57d5@syzkaller.appspotmail.com
Reported-by: syzbot+34b582cf32c1db008f8e@syzkaller.appspotmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200113125358.ac07f276efff.Ibd85ee1b12e47b9efb00a2adc5cd3fac50da791a@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:24 +01:00
Felix Fietkau
df16737d43 cfg80211: fix memory leak in cfg80211_cqm_rssi_update
The per-tid statistics need to be released after the call to rdev_get_station

Cc: stable@vger.kernel.org
Fixes: 8689c051a2 ("cfg80211: dynamically allocate per-tid stats for station info")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200108170630.33680-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:12 +01:00
Felix Fietkau
2a279b3416 cfg80211: fix memory leak in nl80211_probe_mesh_link
The per-tid statistics need to be released after the call to rdev_get_station

Cc: stable@vger.kernel.org
Fixes: 5ab92e7fe4 ("cfg80211: add support to probe unexercised mesh link")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200108170630.33680-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:53:02 +01:00
Markus Theil
5a128a088a cfg80211: fix deadlocks in autodisconnect work
Use methods which do not try to acquire the wdev lock themselves.

Cc: stable@vger.kernel.org
Fixes: 37b1c00468 ("cfg80211: Support all iftypes in autodisconnect_wk")
Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200108115536.2262-1-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:52:49 +01:00
Arnd Bergmann
e16119655c wireless: wext: avoid gcc -O3 warning
After the introduction of CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE_O3,
the wext code produces a bogus warning:

In function 'iw_handler_get_iwstats',
    inlined from 'ioctl_standard_call' at net/wireless/wext-core.c:1015:9,
    inlined from 'wireless_process_ioctl' at net/wireless/wext-core.c:935:10,
    inlined from 'wext_ioctl_dispatch.part.8' at net/wireless/wext-core.c:986:8,
    inlined from 'wext_handle_ioctl':
net/wireless/wext-core.c:671:3: error: argument 1 null where non-null expected [-Werror=nonnull]
   memcpy(extra, stats, sizeof(struct iw_statistics));
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from arch/x86/include/asm/string.h:5,
net/wireless/wext-core.c: In function 'wext_handle_ioctl':
arch/x86/include/asm/string_64.h:14:14: note: in a call to function 'memcpy' declared here

The problem is that ioctl_standard_call() sometimes calls the handler
with a NULL argument that would cause a problem for iw_handler_get_iwstats.
However, iw_handler_get_iwstats never actually gets called that way.

Marking that function as noinline avoids the warning and leads
to slightly smaller object code as well.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lore.kernel.org/r/20200107200741.3588770-1-arnd@arndb.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:52:20 +01:00
Jouni Malinen
6f60126521 mac80211: Fix TKIP replay protection immediately after key setup
TKIP replay protection was skipped for the very first frame received
after a new key is configured. While this is potentially needed to avoid
dropping a frame in some cases, this does leave a window for replay
attacks with group-addressed frames at the station side. Any earlier
frame sent by the AP using the same key would be accepted as a valid
frame and the internal RSC would then be updated to the TSC from that
frame. This would allow multiple previously transmitted group-addressed
frames to be replayed until the next valid new group-addressed frame
from the AP is received by the station.

Fix this by limiting the no-replay-protection exception to apply only
for the case where TSC=0, i.e., when this is for the very first frame
protected using the new key, and the local RSC had not been set to a
higher value when configuring the key (which may happen with GTK).

Signed-off-by: Jouni Malinen <j@w1.fi>
Link: https://lore.kernel.org/r/20200107153545.10934-1-j@w1.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:52:12 +01:00
Orr Mazor
26ec17a1dc cfg80211: Fix radar event during another phy CAC
In case a radar event of CAC_FINISHED or RADAR_DETECTED
happens during another phy is during CAC we might need
to cancel that CAC.

If we got a radar in a channel that another phy is now
doing CAC on then the CAC should be canceled there.

If, for example, 2 phys doing CAC on the same channels,
or on comptable channels, once on of them will finish his
CAC the other might need to cancel his CAC, since it is no
longer relevant.

To fix that the commit adds an callback and implement it in
mac80211 to end CAC.
This commit also adds a call to said callback if after a radar
event we see the CAC is no longer relevant

Signed-off-by: Orr Mazor <Orr.Mazor@tandemg.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
[slightly reformat/reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:50:48 +01:00
Ganapathi Bhat
c4b9d655e4 wireless: fix enabling channel 12 for custom regulatory domain
Commit e33e2241e2 ("Revert "cfg80211: Use 5MHz bandwidth by
default when checking usable channels"") fixed a broken
regulatory (leaving channel 12 open for AP where not permitted).
Apply a similar fix to custom regulatory domain processing.

Signed-off-by: Cathy Luo <xiaohua.luo@nxp.com>
Signed-off-by: Ganapathi Bhat <ganapathi.bhat@nxp.com>
Link: https://lore.kernel.org/r/1576836859-8945-1-git-send-email-ganapathi.bhat@nxp.com
[reword commit message, fix coding style, add a comment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:47:52 +01:00
Ido Schimmel
bb3c4ab93e ipv6: Add "offload" and "trap" indications to routes
In a similar fashion to previous patch, add "offload" and "trap"
indication to IPv6 routes.

This is done by using two unused bits in 'struct fib6_info' to hold
these indications. Capable drivers are expected to set these when
processing the various in-kernel route notifications.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Ido Schimmel
90b93f1b31 ipv4: Add "offload" and "trap" indications to routes
When performing L3 offload, routes and nexthops are usually programmed
into two different tables in the underlying device. Therefore, the fact
that a nexthop resides in hardware does not necessarily mean that all
the associated routes also reside in hardware and vice-versa.

While the kernel can signal to user space the presence of a nexthop in
hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
for routes. In addition, the fact that a route resides in hardware does
not necessarily mean that the traffic is offloaded. For example,
unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
packets to the CPU so that the kernel will be able to generate the
appropriate ICMP error packet.

This patch adds an "offload" and "trap" indications to IPv4 routes, so
that users will have better visibility into the offload process.

'struct fib_alias' is extended with two new fields that indicate if the
route resides in hardware or not and if it is offloading traffic from
the kernel or trapping packets to it. Note that the new fields are added
in the 6 bytes hole and therefore the struct still fits in a single
cache line [1].

Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
route's key in order to set the flags.

The indications are dumped to user space via a new flags (i.e.,
'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
ancillary header.

v2:
* Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()

[1]
struct fib_alias {
        struct hlist_node  fa_list;                      /*     0    16 */
        struct fib_info *          fa_info;              /*    16     8 */
        u8                         fa_tos;               /*    24     1 */
        u8                         fa_type;              /*    25     1 */
        u8                         fa_state;             /*    26     1 */
        u8                         fa_slen;              /*    27     1 */
        u32                        tb_id;                /*    28     4 */
        s16                        fa_default;           /*    32     2 */
        u8                         offload:1;            /*    34: 0  1 */
        u8                         trap:1;               /*    34: 1  1 */
        u8                         unused:6;             /*    34: 2  1 */

        /* XXX 5 bytes hole, try to pack */

        struct callback_head rcu __attribute__((__aligned__(8))); /*    40    16 */

        /* size: 56, cachelines: 1, members: 12 */
        /* sum members: 50, holes: 1, sum holes: 5 */
        /* sum bitfield members: 8 bits (1 bytes) */
        /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Ido Schimmel
1e301fd04e ipv4: Encapsulate function arguments in a struct
fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
using the passed arguments. Currently, the function takes 11 arguments,
6 of which are attributes of the route being dumped (e.g., prefix, TOS).

The next patch will need the function to also dump to user space an
indication if the route is present in hardware or not. Instead of
passing yet another argument, change the function to take a struct
containing the different route attributes.

v2:
* Name last argument of fib_dump_info()
* Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
  later be passed to fib_alias_hw_flags_set()

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Ido Schimmel
6324d0fa03 ipv4: Replace route in list before notifying
Subsequent patches will add an offload / trap indication to routes which
will signal if the route is present in hardware or not.

After programming the route to the hardware, drivers will have to ask
the IPv4 code to set the flags by passing the route's key.

In the case of route replace, the new route is notified before it is
actually inserted into the FIB alias list. This can prevent simple
drivers (e.g., netdevsim) that program the route to the hardware in the
same context it is notified in from being able to set the flag.

Solve this by first inserting the new route to the list and rollback the
operation in case the route was vetoed.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Bjorn Andersson
e04df98adf net: qrtr: Remove receive worker
Rather than enqueuing messages and scheduling a worker to deliver them
to the individual sockets we can now, thanks to the previous work, move
this directly into the endpoint callback.

This saves us a context switch per incoming message and removes the
possibility of an opportunistic suspend to happen between the message is
coming from the endpoint until it ends up in the socket's receive
buffer.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:36:42 -08:00
Bjorn Andersson
f16a4b26f3 net: qrtr: Make qrtr_port_lookup() use RCU
The important part of qrtr_port_lookup() wrt synchronization is that the
function returns a reference counted struct qrtr_sock, or fail.

As such we need only to ensure that an decrement of the object's
refcount happens inbetween the finding of the object in the idr and
qrtr_port_lookup()'s own increment of the object.

By using RCU and putting a synchronization point after we remove the
mapping from the idr, but before it can be released we achieve this -
with the benefit of not having to hold the mutex in qrtr_port_lookup().

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:36:42 -08:00
Bjorn Andersson
0a7e0d0ef0 net: qrtr: Migrate node lookup tree to spinlock
Move operations on the qrtr_nodes radix tree under a separate spinlock
and make the qrtr_nodes tree GFP_ATOMIC, to allow operation from atomic
context in a subsequent patch.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:36:42 -08:00
Bjorn Andersson
5fdeb0d372 net: qrtr: Implement outgoing flow control
In order to prevent overconsumption of resources on the remote side QRTR
implements a flow control mechanism.

The mechanism works by the sender keeping track of the number of
outstanding unconfirmed messages that has been transmitted to a
particular node/port pair.

Upon count reaching a low watermark (L) the confirm_rx bit is set in the
outgoing message and when the count reaching a high watermark (H)
transmission will be blocked upon the reception of a resume_tx message
from the remote, that resets the counter to 0.

This guarantees that there will be at most 2H - L messages in flight.
Values chosen for L and H are 5 and 10 respectively.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:36:41 -08:00
Bjorn Andersson
cb6530b99f net: qrtr: Move resume-tx transmission to recvmsg
The confirm-rx bit is used to implement a per port flow control, in
order to make sure that no messages are dropped due to resource
exhaustion. Move the resume-tx transmission to recvmsg to only confirm
messages as they are consumed by the application.

Signed-off-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:36:26 -08:00
Niu Xilei
7786a1af2a pktgen: Allow configuration of IPv6 source address range
Pktgen can use only one IPv6 source address from output device or src6
command setting. In pressure test we need create lots of sessions more
than 65535. So add src6_min and src6_max command to set the range.

Signed-off-by: Niu Xilei <niu_xilei@163.com>

Changes since v3:
 - function set_src_in6_addr use static instead of static inline
 - precompute min_in6_l,min_in6_h,max_in6_h,max_in6_l in setup time
Changes since v2:
 - reword subject line
Changes since v1:
 - only create IPv6 source address over least significant 64 bit range

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:30:56 -08:00
Sunil Muthuswamy
c742c59e1f hv_sock: Remove the accept port restriction
Currently, hv_sock restricts the port the guest socket can accept
connections on. hv_sock divides the socket port namespace into two parts
for server side (listening socket), 0-0x7FFFFFFF & 0x80000000-0xFFFFFFFF
(there are no restrictions on client port namespace). The first part
(0-0x7FFFFFFF) is reserved for sockets where connections can be accepted.
The second part (0x80000000-0xFFFFFFFF) is reserved for allocating ports
for the peer (host) socket, once a connection is accepted.
This reservation of the port namespace is specific to hv_sock and not
known by the generic vsock library (ex: af_vsock). This is problematic
because auto-binds/ephemeral ports are handled by the generic vsock
library and it has no knowledge of this port reservation and could
allocate a port that is not compatible with hv_sock (and legitimately so).
The issue hasn't surfaced so far because the auto-bind code of vsock
(__vsock_bind_stream) prior to the change 'VSOCK: bind to random port for
VMADDR_PORT_ANY' would start walking up from LAST_RESERVED_PORT (1023) and
start assigning ports. That will take a large number of iterations to hit
0x7FFFFFFF. But, after the above change to randomize port selection, the
issue has started coming up more frequently.
There has really been no good reason to have this port reservation logic
in hv_sock from the get go. Reserving a local port for peer ports is not
how things are handled generally. Peer ports should reflect the peer port.
This fixes the issue by lifting the port reservation, and also returns the
right peer port. Since the code converts the GUID to the peer port (by
using the first 4 bytes), there is a possibility of conflicts, but that
seems like a reasonable risk to take, given this is limited to vsock and
that only applies to all local sockets.

Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:50:53 -08:00
Jason A. Donenfeld
9f3ef3d702 net: mac80211: use skb_list_walk_safe helper for gso segments
This is a conversion case for the new function, keeping the flow of the
existing code as intact as possible. We also switch over to using
skb_mark_not_on_list instead of a null write to skb->next.

Finally, this code appeared to have a memory leak in the case where
header building fails before the last gso segment. In that case, the
remaining segments are not freed. So this commit also adds the proper
kfree_skb_list call for the remainder of the skbs.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Jason A. Donenfeld
2670ee77c9 net: netfilter: use skb_list_walk_safe helper for gso segments
This is a straight-forward conversion case for the new function, keeping
the flow of the existing code as intact as possible.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Jason A. Donenfeld
88bebdf5b2 net: ipv4: use skb_list_walk_safe helper for gso segments
This is a straight-forward conversion case for the new function, keeping
the flow of the existing code as intact as possible.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Jason A. Donenfeld
b950d8a5b3 net: sched: use skb_list_walk_safe helper for gso segments
This is a straight-forward conversion case for the new function, keeping
the flow of the existing code as intact as possible.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Jason A. Donenfeld
2cec4448db net: openvswitch: use skb_list_walk_safe helper for gso segments
This is a straight-forward conversion case for the new function, keeping
the flow of the existing code as intact as possible.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Jason A. Donenfeld
c3b18e0d92 net: xfrm: use skb_list_walk_safe helper for gso segments
This is converts xfrm segment iteration to use the new function, keeping
the flow of the existing code as intact as possible. One case is very
straight-forward, whereas the other case has some more subtle code that
likes to peak at ->next and relink skbs. By keeping the variables the
same as before, we can upgrade this code with minimal surgery required.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Jason A. Donenfeld
1a186c14ce net: udp: use skb_list_walk_safe helper for gso segments
This is a straight-forward conversion case for the new function,
iterating over the return value from udp_rcv_segment, which actually is
a wrapper around skb_gso_segment.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:48:41 -08:00
Li RongQing
0a29275b63 bpf: Return -EBADRQC for invalid map type in __bpf_tx_xdp_map
A negative value should be returned if map->map_type is invalid
although that is impossible now, but if we run into such situation
in future, then xdpbuff could be leaked.

Daniel Borkmann suggested:

-EBADRQC should be returned to stay consistent with generic XDP
for the tracepoint output and not to be confused with -EOPNOTSUPP
from other locations like dev_map_enqueue() when ndo_xdp_xmit is
missing and such.

Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1578618277-18085-1-git-send-email-lirongqing@baidu.com
2020-01-14 20:32:24 +01:00
Guillaume Nault
8d7e5dee97 netns: don't disable BHs when locking "nsid_lock"
When peernet2id() had to lock "nsid_lock" before iterating through the
nsid table, we had to disable BHs, because VXLAN can call peernet2id()
from the xmit path:
  vxlan_xmit() -> vxlan_fdb_miss() -> vxlan_fdb_notify()
    -> __vxlan_fdb_notify() -> vxlan_fdb_info() -> peernet2id().

Now that peernet2id() uses RCU protection, "nsid_lock" isn't used in BH
context anymore. Therefore, we can safely use plain
spin_lock()/spin_unlock() and let BHs run when holding "nsid_lock".

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:28:40 -08:00
Guillaume Nault
2dce224f46 netns: protect netns ID lookups with RCU
__peernet2id() can be protected by RCU as it only calls idr_for_each(),
which is RCU-safe, and never modifies the nsid table.

rtnl_net_dumpid() can also do lockless lookups. It does two nested
idr_for_each() calls on nsid tables (one direct call and one indirect
call because of rtnl_net_dumpid_one() calling __peernet2id()). The
netnsid tables are never updated. Therefore it is safe to not take the
nsid_lock and run within an RCU-critical section instead.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:28:40 -08:00
Guillaume Nault
4905294162 netns: Remove __peernet2id_alloc()
__peernet2id_alloc() was used for both plain lookups and for netns ID
allocations (depending the value of '*alloc'). Let's separate lookups
from allocations instead. That is, integrate the lookup code into
__peernet2id() and make peernet2id_alloc() responsible for allocating
new netns IDs when necessary.

This makes it clear that __peernet2id() doesn't modify the idr and
prepares the code for lockless lookups.

Also, mark the 'net' argument of __peernet2id() as 'const', since we're
modifying this line.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:28:40 -08:00
Chuck Lever
671c450b6f xprtrdma: Fix oops in Receive handler after device removal
Since v5.4, a device removal occasionally triggered this oops:

Dec  2 17:13:53 manet kernel: BUG: unable to handle page fault for address: 0000000c00000219
Dec  2 17:13:53 manet kernel: #PF: supervisor read access in kernel mode
Dec  2 17:13:53 manet kernel: #PF: error_code(0x0000) - not-present page
Dec  2 17:13:53 manet kernel: PGD 0 P4D 0
Dec  2 17:13:53 manet kernel: Oops: 0000 [#1] SMP
Dec  2 17:13:53 manet kernel: CPU: 2 PID: 468 Comm: kworker/2:1H Tainted: G        W         5.4.0-00050-g53717e43af61 #883
Dec  2 17:13:53 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Dec  2 17:13:53 manet kernel: Workqueue: ib-comp-wq ib_cq_poll_work [ib_core]
Dec  2 17:13:53 manet kernel: RIP: 0010:rpcrdma_wc_receive+0x7c/0xf6 [rpcrdma]
Dec  2 17:13:53 manet kernel: Code: 6d 8b 43 14 89 c1 89 45 78 48 89 4d 40 8b 43 2c 89 45 14 8b 43 20 89 45 18 48 8b 45 20 8b 53 14 48 8b 30 48 8b 40 10 48 8b 38 <48> 8b 87 18 02 00 00 48 85 c0 75 18 48 8b 05 1e 24 c4 e1 48 85 c0
Dec  2 17:13:53 manet kernel: RSP: 0018:ffffc900035dfe00 EFLAGS: 00010246
Dec  2 17:13:53 manet kernel: RAX: ffff888467290000 RBX: ffff88846c638400 RCX: 0000000000000048
Dec  2 17:13:53 manet kernel: RDX: 0000000000000048 RSI: 00000000f942e000 RDI: 0000000c00000001
Dec  2 17:13:53 manet kernel: RBP: ffff888467611b00 R08: ffff888464e4a3c4 R09: 0000000000000000
Dec  2 17:13:53 manet kernel: R10: ffffc900035dfc88 R11: fefefefefefefeff R12: ffff888865af4428
Dec  2 17:13:53 manet kernel: R13: ffff888466023000 R14: ffff88846c63f000 R15: 0000000000000010
Dec  2 17:13:53 manet kernel: FS:  0000000000000000(0000) GS:ffff88846fa80000(0000) knlGS:0000000000000000
Dec  2 17:13:53 manet kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  2 17:13:53 manet kernel: CR2: 0000000c00000219 CR3: 0000000002009002 CR4: 00000000001606e0
Dec  2 17:13:53 manet kernel: Call Trace:
Dec  2 17:13:53 manet kernel: __ib_process_cq+0x5c/0x14e [ib_core]
Dec  2 17:13:53 manet kernel: ib_cq_poll_work+0x26/0x70 [ib_core]
Dec  2 17:13:53 manet kernel: process_one_work+0x19d/0x2cd
Dec  2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec  2 17:13:53 manet kernel: worker_thread+0x1a6/0x25a
Dec  2 17:13:53 manet kernel: ? cancel_delayed_work_sync+0xf/0xf
Dec  2 17:13:53 manet kernel: kthread+0xf4/0xf9
Dec  2 17:13:53 manet kernel: ? kthread_queue_delayed_work+0x74/0x74
Dec  2 17:13:53 manet kernel: ret_from_fork+0x24/0x30

The proximal cause is that this rpcrdma_rep has a rr_rdmabuf that
is still pointing to the old ib_device, which has been freed. The
only way that is possible is if this rpcrdma_rep was not destroyed
by rpcrdma_ia_remove.

Debugging showed that was indeed the case: this rpcrdma_rep was
still in use by a completing RPC at the time of the device removal,
and thus wasn't on the rep free list. So, it was not found by
rpcrdma_reps_destroy().

The fix is to introduce a list of all rpcrdma_reps so that they all
can be found when a device is removed. That list is used to perform
only regbuf DMA unmapping, replacing that call to
rpcrdma_reps_destroy().

Meanwhile, to prevent corruption of this list, I've moved the
destruction of temp rpcrdma_rep objects to rpcrdma_post_recvs().
rpcrdma_xprt_drain() ensures that post_recvs (and thus rep_destroy) is
not invoked while rpcrdma_reps_unmap is walking rb_all_reps, thus
protecting the rb_all_reps list.

Fixes: b0b227f071 ("xprtrdma: Use an llist to manage free rpcrdma_reps")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
Chuck Lever
13cb886c59 xprtrdma: Fix completion wait during device removal
I've found that on occasion, "rmmod <dev>" will hang while if an NFS
is under load.

Ensure that ri_remove_done is initialized only just before the
transport is woken up to force a close. This avoids the completion
possibly getting initialized again while the CM event handler is
waiting for a wake-up.

Fixes: bebd031866 ("xprtrdma: Support unplugging an HCA from under an NFS mount")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
Chuck Lever
b32b9ed493 xprtrdma: Fix create_qp crash on device unload
On device re-insertion, the RDMA device driver crashes trying to set
up a new QP:

Nov 27 16:32:06 manet kernel: BUG: kernel NULL pointer dereference, address: 00000000000001c0
Nov 27 16:32:06 manet kernel: #PF: supervisor write access in kernel mode
Nov 27 16:32:06 manet kernel: #PF: error_code(0x0002) - not-present page
Nov 27 16:32:06 manet kernel: PGD 0 P4D 0
Nov 27 16:32:06 manet kernel: Oops: 0002 [#1] SMP
Nov 27 16:32:06 manet kernel: CPU: 1 PID: 345 Comm: kworker/u28:0 Tainted: G        W         5.4.0 #852
Nov 27 16:32:06 manet kernel: Hardware name: Supermicro SYS-6028R-T/X10DRi, BIOS 1.1a 10/16/2015
Nov 27 16:32:06 manet kernel: Workqueue: xprtiod xprt_rdma_connect_worker [rpcrdma]
Nov 27 16:32:06 manet kernel: RIP: 0010:atomic_try_cmpxchg+0x2/0x12
Nov 27 16:32:06 manet kernel: Code: ff ff 48 8b 04 24 5a c3 c6 07 00 0f 1f 40 00 c3 31 c0 48 81 ff 08 09 68 81 72 0c 31 c0 48 81 ff 83 0c 68 81 0f 92 c0 c3 8b 06 <f0> 0f b1 17 0f 94 c2 84 d2 75 02 89 06 88 d0 c3 53 ba 01 00 00 00
Nov 27 16:32:06 manet kernel: RSP: 0018:ffffc900035abbf0 EFLAGS: 00010046
Nov 27 16:32:06 manet kernel: RAX: 0000000000000000 RBX: 00000000000001c0 RCX: 0000000000000000
Nov 27 16:32:06 manet kernel: RDX: 0000000000000001 RSI: ffffc900035abbfc RDI: 00000000000001c0
Nov 27 16:32:06 manet kernel: RBP: ffffc900035abde0 R08: 000000000000000e R09: ffffffffffffc000
Nov 27 16:32:06 manet kernel: R10: 0000000000000000 R11: 000000000002e800 R12: ffff88886169d9f8
Nov 27 16:32:06 manet kernel: R13: ffff88886169d9f4 R14: 0000000000000246 R15: 0000000000000000
Nov 27 16:32:06 manet kernel: FS:  0000000000000000(0000) GS:ffff88846fa40000(0000) knlGS:0000000000000000
Nov 27 16:32:06 manet kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 16:32:06 manet kernel: CR2: 00000000000001c0 CR3: 0000000002009006 CR4: 00000000001606e0
Nov 27 16:32:06 manet kernel: Call Trace:
Nov 27 16:32:06 manet kernel: do_raw_spin_lock+0x2f/0x5a
Nov 27 16:32:06 manet kernel: create_qp_common.isra.47+0x856/0xadf [mlx4_ib]
Nov 27 16:32:06 manet kernel: ? slab_post_alloc_hook.isra.60+0xa/0x1a
Nov 27 16:32:06 manet kernel: ? __kmalloc+0x125/0x139
Nov 27 16:32:06 manet kernel: mlx4_ib_create_qp+0x57f/0x972 [mlx4_ib]

The fix is to copy the qp_init_attr struct that was just created by
rpcrdma_ep_create() instead of using the one from the previous
connection instance.

Fixes: 98ef77d1aa ("xprtrdma: Send Queue size grows after a reconnect")
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-01-14 13:30:24 -05:00
Xu Wang
8aaea2b042 xfrm: interface: do not confirm neighbor when do pmtu update
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14 08:55:38 +01:00
Nicolas Dichtel
f042365dbf xfrm interface: fix packet tx through bpf_redirect()
With an ebpf program that redirects packets through a xfrm interface,
packets are dropped because no dst is attached to skb.

This could also be reproduced with an AF_PACKET socket, with the following
python script (xfrm1 is a xfrm interface):

 import socket
 send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
 # scapy
 # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
 # raw(p)
 req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
 send_s.sendto(req, ('xfrm1', 0x800, 0, 0))

It was also not possible to send an ip packet through an AF_PACKET socket
because a LL header was expected. Let's remove those LL header constraints.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14 08:55:38 +01:00
Nicolas Dichtel
95224166a9 vti[6]: fix packet tx through bpf_redirect()
With an ebpf program that redirects packets through a vti[6] interface,
the packets are dropped because no dst is attached.

This could also be reproduced with an AF_PACKET socket, with the following
python script (vti1 is an ip_vti interface):

 import socket
 send_s = socket.socket(socket.AF_PACKET, socket.SOCK_RAW, 0)
 # scapy
 # p = IP(src='10.100.0.2', dst='10.200.0.1')/ICMP(type='echo-request')
 # raw(p)
 req = b'E\x00\x00\x1c\x00\x01\x00\x00@\x01e\xb2\nd\x00\x02\n\xc8\x00\x01\x08\x00\xf7\xff\x00\x00\x00\x00'
 send_s.sendto(req, ('vti1', 0x800, 0, 0))

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-01-14 08:55:38 +01:00
Florian Westphal
212e7f5660 netfilter: arp_tables: init netns pointer in xt_tgdtor_param struct
An earlier commit (1b789577f6,
"netfilter: arp_tables: init netns pointer in xt_tgchk_param struct")
fixed missing net initialization for arptables, but turns out it was
incomplete.  We can get a very similar struct net NULL deref during
error unwinding:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:xt_rateest_put+0xa1/0x440 net/netfilter/xt_RATEEST.c:77
 xt_rateest_tg_destroy+0x72/0xa0 net/netfilter/xt_RATEEST.c:175
 cleanup_entry net/ipv4/netfilter/arp_tables.c:509 [inline]
 translate_table+0x11f4/0x1d80 net/ipv4/netfilter/arp_tables.c:587
 do_replace net/ipv4/netfilter/arp_tables.c:981 [inline]
 do_arpt_set_ctl+0x317/0x650 net/ipv4/netfilter/arp_tables.c:1461

Also init the netns pointer in xt_tgdtor_param struct.

Fixes: add6746124 ("netfilter: add struct net * to target parameters")
Reported-by: syzbot+91bdd8eece0f6629ec8b@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-13 19:22:10 +01:00
Cong Wang
c120959387 netfilter: fix a use-after-free in mtype_destroy()
map->members is freed by ip_set_free() right before using it in
mtype_ext_cleanup() again. So we just have to move it down.

Reported-by: syzbot+4c3cc6dbe7259dbf9054@syzkaller.appspotmail.com
Fixes: 40cd63bf33 ("netfilter: ipset: Support extensions which need a per data destroy function")
Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-13 18:53:59 +01:00
Jacob Keller
b0efcae5e1 devlink: correct misspelling of snapshot
The function to obtain a unique snapshot id was mistakenly typo'd as
devlink_region_shapshot_id_get. Fix this typo by renaming the function
and all of its users.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-11 14:30:24 -08:00
Ido Schimmel
4c582234ab devlink: Wait longer before warning about unset port type
The commit cited below causes devlink to emit a warning if a type was
not set on a devlink port for longer than 30 seconds to "prevent
misbehavior of drivers". This proved to be problematic when
unregistering the backing netdev. The flow is always:

devlink_port_type_clear()	// schedules the warning
unregister_netdev()		// blocking
devlink_port_unregister()	// cancels the warning

The call to unregister_netdev() can block for long periods of time for
various reasons: RTNL lock is contended, large amounts of configuration
to unroll following dismantle of the netdev, etc. This results in
devlink emitting a warning despite the driver behaving correctly.

In emulated environments (of future hardware) which are usually very
slow, the warning can also be emitted during port creation as more than
30 seconds can pass between the time the devlink port is registered and
when its type is set.

In addition, syzbot has hit this warning [1] 1974 times since 07/11/19
without being able to produce a reproducer. Probably because
reproduction depends on the load or other bugs (e.g., RTNL not being
released).

To prevent bogus warnings, increase the timeout to 1 hour.

[1] https://syzkaller.appspot.com/bug?id=e99b59e9c024a666c9f7450dc162a4b74d09d9cb

Fixes: 136bf27fc0 ("devlink: add warning in case driver does not set port type")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reported-by: syzbot+b0a18ed7b08b735d2f41@syzkaller.appspotmail.com
Reported-by: Alex Veber <alexve@mellanox.com>
Tested-by: Alex Veber <alexve@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 16:45:40 -08:00
David Ahern
9827c0634e ipv4: Detect rollover in specific fib table dump
Sven-Haegar reported looping on fib dumps when 255.255.255.255 route has
been added to a table. The looping is caused by the key rolling over from
FFFFFFFF to 0. When dumping a specific table only, we need a means to detect
when the table dump is done. The key and count saved to cb args are both 0
only at the start of the table dump. If key is 0 and count > 0, then we are
in the rollover case. Detect and return to avoid looping.

This only affects dumps of a specific table; for dumps of all tables
(the case prior to the change in the Fixes tag) inet_dump_fib moved
the entry counter to the next table and reset the cb args used by
fib_table_dump and fn_trie_dump_leaf, so the rollover ffffffff back
to 0 did not cause looping with the dumps.

Fixes: effe679266 ("net: Enable kernel side filtering of route dumps")
Reported-by: Sven-Haegar Koch <haegar@sdinet.de>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 11:36:36 -08:00
Jakub Kicinski
db885e66d2 net/tls: fix async operation
Mallesham reports the TLS with async accelerator was broken by
commit d10523d0b3 ("net/tls: free the record on encryption error")
because encryption can return -EINPROGRESS in such setups, which
should not be treated as an error.

The error is also present in the BPF path (likely copied from there).

Reported-by: Mallesham Jatharakonda <mallesham.jatharakonda@oneconvergence.com>
Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Fixes: d10523d0b3 ("net/tls: free the record on encryption error")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 11:20:46 -08:00
Jakub Kicinski
5c5d22a750 net/tls: avoid spurious decryption error with HW resync
When device loses sync mid way through a record - kernel
has to re-encrypt the part of the record which the device
already decrypted to be able to decrypt and authenticate
the record in its entirety.

The re-encryption piggy backs on the decryption routine,
but obviously because the partially decrypted record can't
be authenticated crypto API returns an error which is then
ignored by tls_device_reencrypt().

Commit 5c5ec66858 ("net/tls: add TlsDecryptError stat")
added a statistic to count decryption errors, this statistic
can't be incremented when we see the expected re-encryption
error. Move the inc to the caller.

Reported-and-tested-by: David Beckett <david.beckett@netronome.com>
Fixes: 5c5ec66858 ("net/tls: add TlsDecryptError stat")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 11:18:15 -08:00
Lorenz Bauer
2e012c7482 net: bpf: Don't leak time wait and request sockets
It's possible to leak time wait and request sockets via the following
BPF pseudo code:
 
  sk = bpf_skc_lookup_tcp(...)
  if (sk)
    bpf_sk_release(sk)

If sk->sk_state is TCP_NEW_SYN_RECV or TCP_TIME_WAIT the refcount taken
by bpf_skc_lookup_tcp is not undone by bpf_sk_release. This is because
sk_flags is re-used for other data in both kinds of sockets. The check

  !sock_flag(sk, SOCK_RCU_FREE)

therefore returns a bogus result. Check that sk_flags is valid by calling
sk_fullsock. Skip checking SOCK_RCU_FREE if we already know that sk is
not a full socket.

Fixes: edbf8c01de ("bpf: add skc_lookup_tcp helper")
Fixes: f7355a6c04 ("bpf: Check sk_fullsock() before returning from bpf_sk_lookup()")
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200110132336.26099-1-lmb@cloudflare.com
2020-01-10 10:42:23 -08:00
Paolo Abeni
8b69a80381 skb: add helpers to allocate ext independently from sk_buff
Currently we can allocate the extension only after the skb,
this change allows the user to do the opposite, will simplify
allocation failure handling from MPTCP.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:42 -08:00
Mat Martineau
9cfcca2389 tcp: Check for filled TCP option space before SACK
Update the SACK check to work with zero option space available, a case
that's possible with MPTCP but not MD5+TS. Maintained only one
conditional branch for insufficient SACK space.

v1 -> v2:
- Moves the check inside the SACK branch by taking recent SACK fix:

    9424e2e7ad (tcp: md5: fix potential overestimation of TCP option space)

  in to account, but modifies it to work in MPTCP scenarios beyond the
  MD5+TS corner case.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
35b2c32116 tcp: Export TCP functions and ops struct
MPTCP will make use of tcp_send_mss() and tcp_push() when sending
data to specific TCP subflows.

tcp_request_sock_ipvX_ops and ipvX_specific will be referenced
during TCP subflow creation.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
8571248411 tcp: coalesce/collapse must respect MPTCP extensions
Coalesce and collapse of packets carrying MPTCP extensions is allowed
when the newer packet has no extension or the extensions carried by both
packets are equal.

This allows merging of TSO packet trains and even cross-TSO packets, and
does not require any additional action when moving data into existing
SKBs.

v3 -> v4:
 - allow collapsing, under mptcp_skb_can_collapse() constraint

v5 -> v6:
 - clarify MPTCP skb extensions must always be cleared at allocation
   time

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
3ee17bc78e mptcp: Add MPTCP to skb extensions
Add enum value for MPTCP and update config dependencies

v5 -> v6:
 - fixed '__unused' field size

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
1323059301 tcp, ulp: Add clone operation to tcp_ulp_ops
If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are
copied when the listener is cloned. Sometimes the clone is immediately
deleted, which will invoke the release op on the clone and likely
corrupt the listening socket's icsk_ulp_data.

The clone operation is invoked immediately after the clone is copied and
gives the ULP type an opportunity to set up the clone socket and its
icsk_ulp_data.

The MPTCP ULP clone will silently fallback to plain TCP on allocation
failure, so 'clone()' does not need to return an error code.

v6 -> v7:
 - move and rename ulp clone helper to make it inline-friendly
v5 -> v6:
 - clarified MPTCP clone usage in commit message

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
bf9765145b sock: Make sk_protocol a 16-bit value
Match the 16-bit width of skbuff->protocol. Fills an 8-bit hole so
sizeof(struct sock) does not change.

Also take care of BPF field access for sk_type/sk_protocol. Both of them
are now outside the bitfield, so we can use load instructions without
further shifting/masking.

v5 -> v6:
 - update eBPF accessors, too (Intel's kbuild test robot)
v2 -> v3:
 - keep 'sk_type' 2 bytes aligned (Eric)
v1 -> v2:
 - preserve sk_pacing_shift as bit field (Eric)

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
e9cdced78d net: Make sock protocol value checks more specific
SK_PROTOCOL_MAX is only used in two places, for DECNet and AX.25. The
limits have more to do with the those protocol definitions than they do
with the data type of sk_protocol, so remove SK_PROTOCOL_MAX and use
U8_MAX directly.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:40 -08:00
Martin Schiller
e21dba7a4d net/x25: fix nonblocking connect
This patch fixes 2 issues in x25_connect():

1. It makes absolutely no sense to reset the neighbour and the
connection state after a (successful) nonblocking call of x25_connect.
This prevents any connection from being established, since the response
(call accept) cannot be processed.

2. Any further calls to x25_connect() while a call is pending should
simply return, instead of creating new Call Request (on different
logical channels).

This patch should also fix the "KASAN: null-ptr-deref Write in
x25_connect" and "BUG: unable to handle kernel NULL pointer dereference
in x25_connect" bugs reported by syzbot.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Reported-by: syzbot+429c200ffc8772bfe070@syzkaller.appspotmail.com
Reported-by: syzbot+eec0c87f31a7c3b66f7b@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:39:33 -08:00
Li RongQing
6b3acfc3cc flow_dissector: fix document for skb_flow_get_icmp_tci
using correct input parameter name to fix the below warning:

net/core/flow_dissector.c:242: warning: Function parameter or member 'thoff' not described in 'skb_flow_get_icmp_tci'
net/core/flow_dissector.c:242: warning: Excess function parameter 'toff' description in 'skb_flow_get_icmp_tci'

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:37:00 -08:00
Vijay Khemka
5e0fcc16e5 net/ncsi: Support for multi host mellanox card
Multi host Mellanox cards require MAC affinity to be set
before receiving any config commands. All config commands
should also have unicast address for source address in
command header.

Adding GMA and SMAF(Set Mac Affinity) for Mellanox card
and call these in channel probe state machine if it is
defined in device tree.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:36:22 -08:00
Lingpeng Chen
e7a5f1f1cd bpf/sockmap: Read psock ingress_msg before sk_receive_queue
Right now in tcp_bpf_recvmsg, sock read data first from sk_receive_queue
if not empty than psock->ingress_msg otherwise. If a FIN packet arrives
and there's also some data in psock->ingress_msg, the data in
psock->ingress_msg will be purged. It is always happen when request to a
HTTP1.0 server like python SimpleHTTPServer since the server send FIN
packet after data is sent out.

Fixes: 604326b41a ("bpf, sockmap: convert to generic sk_msg interface")
Reported-by: Arika Chen <eaglesora@gmail.com>
Suggested-by: Arika Chen <eaglesora@gmail.com>
Signed-off-by: Lingpeng Chen <forrest0579@gmail.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200109014833.18951-1-forrest0579@gmail.com
2020-01-09 23:13:48 +01:00
David S. Miller
a2d6d7ae59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The ungrafting from PRIO bug fixes in net, when merged into net-next,
merge cleanly but create a build failure.  The resolution used here is
from Petr Machata.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 12:13:43 -08:00
Martin KaFai Lau
206057fe02 bpf: Add BPF_FUNC_tcp_send_ack helper
Add a helper to send out a tcp-ack.  It will be used in the later
bpf_dctcp implementation that requires to send out an ack
when the CE state changed.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109004551.3900448-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Martin KaFai Lau
0baf26b0fc bpf: tcp: Support tcp_congestion_ops in bpf
This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Dan Carpenter
ac9c41d5a0 ethtool: potential NULL dereference in strset_prepare_data()
Smatch complains that the NULL checking isn't done consistently:

    net/ethtool/strset.c:253 strset_prepare_data()
    error: we previously assumed 'dev' could be null (see line 233)

It looks like there is a missing return on this path.

Fixes: 71921690f9 ("ethtool: provide string sets with STRSET_GET request")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 16:03:35 -08:00
Dan Carpenter
d97772dbd7 ethtool: fix ->reply_size() error handling
The "ret < 0" comparison is never true because "ret" is still zero.

Fixes: 728480f124 ("ethtool: default handlers for GET requests")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 16:03:35 -08:00
Dan Carpenter
a6dd04807c ethtool: fix a memory leak in ethnl_default_start()
If ethnl_default_parse() fails then we need to free a couple
memory allocations before returning.

Fixes: 728480f124 ("ethtool: default handlers for GET requests")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 16:03:35 -08:00
Florian Fainelli
4d776482ec net: dsa: Get information about stacked DSA protocol
It is possible to stack multiple DSA switches in a way that they are not
part of the tree (disjoint) but the DSA master of a switch is a DSA
slave of another. When that happens switch drivers may have to know this
is the case so as to determine whether their tagging protocol has a
remove chance of working.

This is useful for specific switch drivers such as b53 where devices
have been known to be stacked in the wild without the Broadcom tag
protocol supporting that feature. This allows b53 to continue supporting
those devices by forcing the disabling of Broadcom tags on the outermost
switches if necessary.

The get_tag_protocol() function is therefore updated to gain an
additional enum dsa_tag_protocol argument which denotes the current
tagging protocol used by the DSA master we are attached to, else
DSA_TAG_PROTO_NONE for the top of the dsa_switch_tree.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 16:01:13 -08:00
Tuong Lien
9546a0b7ce tipc: fix wrong connect() return code
The current 'tipc_wait_for_connect()' function does a wait-loop for the
condition 'sk->sk_state != TIPC_CONNECTING' to conclude if the socket
connecting has done. However, when the condition is met, it returns '0'
even in the case the connecting is actually failed, the socket state is
set to 'TIPC_DISCONNECTING' (e.g. when the server socket has closed..).
This results in a wrong return code for the 'connect()' call from user,
making it believe that the connection is established and go ahead with
building, sending a message, etc. but finally failed e.g. '-EPIPE'.

This commit fixes the issue by changing the wait condition to the
'tipc_sk_connected(sk)', so the function will return '0' only when the
connection is really established. Otherwise, either the socket 'sk_err'
if any or '-ETIMEDOUT'/'-EINTR' will be returned correspondingly.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:57:35 -08:00
Tuong Lien
49afb806cb tipc: fix link overflow issue at socket shutdown
When a socket is suddenly shutdown or released, it will reject all the
unreceived messages in its receive queue. This applies to a connected
socket too, whereas there is only one 'FIN' message required to be sent
back to its peer in this case.

In case there are many messages in the queue and/or some connections
with such messages are shutdown at the same time, the link layer will
easily get overflowed at the 'TIPC_SYSTEM_IMPORTANCE' backlog level
because of the message rejections. As a result, the link will be taken
down. Moreover, immediately when the link is re-established, the socket
layer can continue to reject the messages and the same issue happens...

The commit refactors the '__tipc_shutdown()' function to only send one
'FIN' in the situation mentioned above. For the connectionless case, it
is unavoidable but usually there is no rejections for such socket
messages because they are 'dest-droppable' by default.

In addition, the new code makes the other socket states clear
(e.g.'TIPC_LISTEN') and treats as a separate case to avoid misbehaving.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:57:07 -08:00
Vikas Gupta
97ff3bd37f devlink: add devink notification when reporter update health state
add a devlink notification when reporter update the health
state.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:47:50 -08:00
Vikas Gupta
6181e5cb75 devlink: add support for reporter recovery completion
It is possible that a reporter recovery completion do not finish
successfully when recovery is triggered via
devlink_health_reporter_recover as recovery could be processed in
different context. In such scenario an error is returned by driver when
recover hook is invoked and successful recovery completion is
intimated later.
Expose devlink recover done API to update recovery stats.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:47:50 -08:00
David S. Miller
b73a65610b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing netns context in arp_tables, from Florian Westphal.

2) Underflow in flowtable reference counter, from wenxu.

3) Fix incorrect ethernet destination address in flowtable offload,
   from wenxu.

4) Check for status of neighbour entry, from wenxu.

5) Fix NAT port mangling, from wenxu.

6) Unbind callbacks from destroy path to cleanup hardware properly
   on flowtable removal.

7) Fix missing casting statistics timestamp, add nf_flowtable_time_stamp
   and use it.

8) NULL pointer exception when timeout argument is null in conntrack
   dccp and sctp protocol helpers, from Florian Westphal.

9) Possible nul-dereference in ipset with IPSET_ATTR_LINENO, also from
   Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:22:41 -08:00
Arnd Bergmann
542d3065f2 socket: fix unused-function warning
When procfs is disabled, the fdinfo code causes a harmless
warning:

net/socket.c:1000:13: error: 'sock_show_fdinfo' defined but not used [-Werror=unused-function]
 static void sock_show_fdinfo(struct seq_file *m, struct file *f)

Move the function definition up so we can use a single #ifdef
around it.

Fixes: b4653342b1 ("net: Allow to show socket-specific information in /proc/[pid]/fdinfo/[fd]")
Suggested-by: Al Viro <viro@zeniv.linux.org.uk>
Acked-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:02:21 -08:00
Florian Westphal
22dad713b8 netfilter: ipset: avoid null deref when IPSET_ATTR_LINENO is present
The set uadt functions assume lineno is never NULL, but it is in
case of ip_set_utest().

syzkaller managed to generate a netlink message that calls this with
LINENO attr present:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: 0010:hash_mac4_uadt+0x1bc/0x470 net/netfilter/ipset/ip_set_hash_mac.c:104
Call Trace:
 ip_set_utest+0x55b/0x890 net/netfilter/ipset/ip_set_core.c:1867
 nfnetlink_rcv_msg+0xcf2/0xfb0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477
 nfnetlink_rcv+0x1ba/0x460 net/netfilter/nfnetlink.c:563

pass a dummy lineno storage, its easier than patching all set
implementations.

This seems to be a day-0 bug.

Cc: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Reported-by: syzbot+34bd2369d38707f3f4a7@syzkaller.appspotmail.com
Fixes: a7b4f989a6 ("netfilter: ipset: IP set core support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Jozsef Kadlecsik <kadlec@blackhole.kfki.hu>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-08 23:31:46 +01:00
Florian Westphal
1d9a7acd3d netfilter: conntrack: dccp, sctp: handle null timeout argument
The timeout pointer can be NULL which means we should modify the
per-nets timeout instead.

All do this, except sctp and dccp which instead give:

general protection fault: 0000 [#1] PREEMPT SMP KASAN
net/netfilter/nf_conntrack_proto_dccp.c:682
 ctnl_timeout_parse_policy+0x150/0x1d0 net/netfilter/nfnetlink_cttimeout.c:67
 cttimeout_default_set+0x150/0x1c0 net/netfilter/nfnetlink_cttimeout.c:368
 nfnetlink_rcv_msg+0xcf2/0xfb0 net/netfilter/nfnetlink.c:229
 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477

Reported-by: syzbot+46a4ad33f345d1dd346e@syzkaller.appspotmail.com
Fixes: c779e84960 ("netfilter: conntrack: remove get_timeout() indirection")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-08 23:31:22 +01:00
Arnd Bergmann
b9ae512736 hsr: fix dummy hsr_debugfs_rename() declaration
The hsr_debugfs_rename prototype got an extra 'void' that needs to
be removed again:

In file included from /git/arm-soc/net/hsr/hsr_main.c:12:
net/hsr/hsr_main.h:194:20: error: two or more data types in declaration specifiers
 static inline void void hsr_debugfs_rename(struct net_device *dev)

Fixes: 4c2d5e33dc ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 13:09:54 -08:00
Vijay Khemka
7c7b58d46b net/ncsi: Send device address as source address
After receiving device mac address from device, send this as
a source address for further commands instead of broadcast
address.

This will help in multi host NIC cards.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 13:09:18 -08:00
Petr Machata
240ce7f642 net: sch_prio: When ungrafting, replace with FIFO
When a child Qdisc is removed from one of the PRIO Qdisc's bands, it is
replaced unconditionally by a NOOP qdisc. As a result, any traffic hitting
that band gets dropped. That is incorrect--no Qdisc was explicitly added
when PRIO was created, and after removal, none should have to be added
either.

Fix PRIO by first attempting to create a default Qdisc and only falling
back to noop when that fails. This pattern of attempting to create an
invisible FIFO, using NOOP only as a fallback, is also seen in other
Qdiscs.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:45:53 -08:00
Colin Ian King
788d10c02f Bluetooth: remove redundant assignment to variable icid
Variable icid is being rc is assigned with a value that is never
read. The assignment is redundant and can be removed.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-08 21:44:22 +01:00
Eric Dumazet
d9e15a2733 pkt_sched: fq: do not accept silly TCA_FQ_QUANTUM
As diagnosed by Florian :

If TCA_FQ_QUANTUM is set to 0x80000000, fq_deueue()
can loop forever in :

if (f->credit <= 0) {
  f->credit += q->quantum;
  goto begin;
}

... because f->credit is either 0 or -2147483648.

Let's limit TCA_FQ_QUANTUM to no more than 1 << 20 :
This max value should limit risks of breaking user setups
while fixing this bug.

Fixes: afe4fd0624 ("pkt_sched: fq: Fair Queue packet scheduler")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Florian Westphal <fw@strlen.de>
Reported-by: syzbot+dc9071cc5a85950bdfce@syzkaller.appspotmail.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:40:47 -08:00
Masahiro Yamada
b969fee12b tipc: remove meaningless assignment in Makefile
There is no module named tipc_diag.

The assignment to tipc_diag-y has no effect.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:38:54 -08:00
Masahiro Yamada
ea04b445a2 tipc: do not add socket.o to tipc-y twice
net/tipc/Makefile adds socket.o twice.

tipc-y	+= addr.o bcast.o bearer.o \
           core.o link.o discover.o msg.o  \
           name_distr.o  subscr.o monitor.o name_table.o net.o  \
           netlink.o netlink_compat.o node.o socket.o eth_media.o \
                                             ^^^^^^^^
           topsrv.o socket.o group.o trace.o
                    ^^^^^^^^

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 12:38:54 -08:00
Colin Ian King
53ebeca24a net/rose: remove redundant assignment to variable failed
The variable failed is being assigned a value that is never read, the
following goto statement jumps to the end of the function and variable
failed is not referenced at all.  Remove the redundant assignment.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-07 13:52:28 -08:00
Eric Dumazet
eb8ef2a3c5 vlan: vlan_changelink() should propagate errors
Both vlan_dev_change_flags() and vlan_dev_set_egress_priority()
can return an error. vlan_changelink() should not ignore them.

Fixes: 07b5b17e15 ("[VLAN]: Use rtnl_link API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-07 13:35:14 -08:00
Eric Dumazet
9bbd917e0b vlan: fix memory leak in vlan_dev_set_egress_priority
There are few cases where the ndo_uninit() handler might be not
called if an error happens while device is initialized.

Since vlan_newlink() calls vlan_changelink() before
trying to register the netdevice, we need to make sure
vlan_dev_uninit() has been called at least once,
or we might leak allocated memory.

BUG: memory leak
unreferenced object 0xffff888122a206c0 (size 32):
  comm "syz-executor511", pid 7124, jiffies 4294950399 (age 32.240s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 61 73 00 00 00 00 00 00 00 00  ......as........
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<000000000eb3bb85>] kmemleak_alloc_recursive include/linux/kmemleak.h:43 [inline]
    [<000000000eb3bb85>] slab_post_alloc_hook mm/slab.h:586 [inline]
    [<000000000eb3bb85>] slab_alloc mm/slab.c:3320 [inline]
    [<000000000eb3bb85>] kmem_cache_alloc_trace+0x145/0x2c0 mm/slab.c:3549
    [<000000007b99f620>] kmalloc include/linux/slab.h:556 [inline]
    [<000000007b99f620>] vlan_dev_set_egress_priority+0xcc/0x150 net/8021q/vlan_dev.c:194
    [<000000007b0cb745>] vlan_changelink+0xd6/0x140 net/8021q/vlan_netlink.c:126
    [<0000000065aba83a>] vlan_newlink+0x135/0x200 net/8021q/vlan_netlink.c:181
    [<00000000fb5dd7a2>] __rtnl_newlink+0x89a/0xb80 net/core/rtnetlink.c:3305
    [<00000000ae4273a1>] rtnl_newlink+0x4e/0x80 net/core/rtnetlink.c:3363
    [<00000000decab39f>] rtnetlink_rcv_msg+0x178/0x4b0 net/core/rtnetlink.c:5424
    [<00000000accba4ee>] netlink_rcv_skb+0x61/0x170 net/netlink/af_netlink.c:2477
    [<00000000319fe20f>] rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5442
    [<00000000d51938dc>] netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
    [<00000000d51938dc>] netlink_unicast+0x223/0x310 net/netlink/af_netlink.c:1328
    [<00000000e539ac79>] netlink_sendmsg+0x2c0/0x570 net/netlink/af_netlink.c:1917
    [<000000006250c27e>] sock_sendmsg_nosec net/socket.c:639 [inline]
    [<000000006250c27e>] sock_sendmsg+0x54/0x70 net/socket.c:659
    [<00000000e2a156d1>] ____sys_sendmsg+0x2d0/0x300 net/socket.c:2330
    [<000000008c87466e>] ___sys_sendmsg+0x8a/0xd0 net/socket.c:2384
    [<00000000110e3054>] __sys_sendmsg+0x80/0xf0 net/socket.c:2417
    [<00000000d71077c8>] __do_sys_sendmsg net/socket.c:2426 [inline]
    [<00000000d71077c8>] __se_sys_sendmsg net/socket.c:2424 [inline]
    [<00000000d71077c8>] __x64_sys_sendmsg+0x23/0x30 net/socket.c:2424

Fixe: 07b5b17e15 ("[VLAN]: Use rtnl_link API")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-07 13:35:14 -08:00
Xin Long
be7a772920 sctp: free cmd->obj.chunk for the unprocessed SCTP_CMD_REPLY
This patch is to fix a memleak caused by no place to free cmd->obj.chunk
for the unprocessed SCTP_CMD_REPLY. This issue occurs when failing to
process a cmd while there're still SCTP_CMD_REPLY cmds on the cmd seq
with an allocated chunk in cmd->obj.chunk.

So fix it by freeing cmd->obj.chunk for each SCTP_CMD_REPLY cmd left on
the cmd seq when any cmd returns error. While at it, also remove 'nomem'
label.

Reported-by: syzbot+107c4aff5f392bf1517f@syzkaller.appspotmail.com
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06 13:28:37 -08:00
Ying Xue
a7869e5f91 tipc: eliminate KMSAN: uninit-value in __tipc_nl_compat_dumpit error
syzbot found the following crash on:
=====================================================
BUG: KMSAN: uninit-value in __nlmsg_parse include/net/netlink.h:661 [inline]
BUG: KMSAN: uninit-value in nlmsg_parse_deprecated
include/net/netlink.h:706 [inline]
BUG: KMSAN: uninit-value in __tipc_nl_compat_dumpit+0x553/0x11e0
net/tipc/netlink_compat.c:215
CPU: 0 PID: 12425 Comm: syz-executor062 Not tainted 5.5.0-rc1-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:77 [inline]
  dump_stack+0x1c9/0x220 lib/dump_stack.c:118
  kmsan_report+0x128/0x220 mm/kmsan/kmsan_report.c:108
  __msan_warning+0x57/0xa0 mm/kmsan/kmsan_instr.c:245
  __nlmsg_parse include/net/netlink.h:661 [inline]
  nlmsg_parse_deprecated include/net/netlink.h:706 [inline]
  __tipc_nl_compat_dumpit+0x553/0x11e0 net/tipc/netlink_compat.c:215
  tipc_nl_compat_dumpit+0x761/0x910 net/tipc/netlink_compat.c:308
  tipc_nl_compat_handle net/tipc/netlink_compat.c:1252 [inline]
  tipc_nl_compat_recv+0x12e9/0x2870 net/tipc/netlink_compat.c:1311
  genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
  genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
  genl_rcv_msg+0x1dd0/0x23a0 net/netlink/genetlink.c:734
  netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
  genl_rcv+0x63/0x80 net/netlink/genetlink.c:745
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0xfa0/0x1100 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x11f0/0x1480 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:639 [inline]
  sock_sendmsg net/socket.c:659 [inline]
  ____sys_sendmsg+0x1362/0x13f0 net/socket.c:2330
  ___sys_sendmsg net/socket.c:2384 [inline]
  __sys_sendmsg+0x4f0/0x5e0 net/socket.c:2417
  __do_sys_sendmsg net/socket.c:2426 [inline]
  __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
  __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
  do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x444179
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 0f 83 1b d8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffd2d6409c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000004002e0 RCX: 0000000000444179
RDX: 0000000000000000 RSI: 0000000020000140 RDI: 0000000000000003
RBP: 00000000006ce018 R08: 0000000000000000 R09: 00000000004002e0
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401e20
R13: 0000000000401eb0 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:149 [inline]
  kmsan_internal_poison_shadow+0x5c/0x110 mm/kmsan/kmsan.c:132
  kmsan_slab_alloc+0x8a/0xe0 mm/kmsan/kmsan_hooks.c:86
  slab_alloc_node mm/slub.c:2774 [inline]
  __kmalloc_node_track_caller+0xe47/0x11f0 mm/slub.c:4382
  __kmalloc_reserve net/core/skbuff.c:141 [inline]
  __alloc_skb+0x309/0xa50 net/core/skbuff.c:209
  alloc_skb include/linux/skbuff.h:1049 [inline]
  nlmsg_new include/net/netlink.h:888 [inline]
  tipc_nl_compat_dumpit+0x6e4/0x910 net/tipc/netlink_compat.c:301
  tipc_nl_compat_handle net/tipc/netlink_compat.c:1252 [inline]
  tipc_nl_compat_recv+0x12e9/0x2870 net/tipc/netlink_compat.c:1311
  genl_family_rcv_msg_doit net/netlink/genetlink.c:672 [inline]
  genl_family_rcv_msg net/netlink/genetlink.c:717 [inline]
  genl_rcv_msg+0x1dd0/0x23a0 net/netlink/genetlink.c:734
  netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
  genl_rcv+0x63/0x80 net/netlink/genetlink.c:745
  netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
  netlink_unicast+0xfa0/0x1100 net/netlink/af_netlink.c:1328
  netlink_sendmsg+0x11f0/0x1480 net/netlink/af_netlink.c:1917
  sock_sendmsg_nosec net/socket.c:639 [inline]
  sock_sendmsg net/socket.c:659 [inline]
  ____sys_sendmsg+0x1362/0x13f0 net/socket.c:2330
  ___sys_sendmsg net/socket.c:2384 [inline]
  __sys_sendmsg+0x4f0/0x5e0 net/socket.c:2417
  __do_sys_sendmsg net/socket.c:2426 [inline]
  __se_sys_sendmsg+0x97/0xb0 net/socket.c:2424
  __x64_sys_sendmsg+0x4a/0x70 net/socket.c:2424
  do_syscall_64+0xb6/0x160 arch/x86/entry/common.c:295
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
=====================================================

The complaint above occurred because the memory region pointed by attrbuf
variable was not initialized. To eliminate this warning, we use kcalloc()
rather than kmalloc_array() to allocate memory for attrbuf.

Reported-by: syzbot+b1fd2bf2c89d8407e15f@syzkaller.appspotmail.com
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-06 13:24:31 -08:00
Pablo Neira Ayuso
fb46f1b780 netfilter: flowtable: add nf_flowtable_time_stamp
This patch adds nf_flowtable_time_stamp and updates the existing code to
use it.

This patch is also implicitly fixing up hardware statistic fetching via
nf_flow_offload_stats() where casting to u32 is missing. Use
nf_flow_timeout_delta() to fix this.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2020-01-06 10:30:46 +01:00
Vladimir Oltean
787cac3f5a net: dsa: Pass pcs_poll flag from driver to PHYLINK
The DSA drivers that implement .phylink_mac_link_state should normally
register an interrupt for the PCS, from which they should call
phylink_mac_change(). However not all switches implement this, and those
who don't should set this flag in dsa_switch in the .setup callback, so
that PHYLINK will poll for a few ms until the in-band AN link timer
expires and the PCS state settles.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 23:22:32 -08:00
Vladimir Oltean
2821d50fc0 net: dsa: tag_sja1105: Slightly improve the Xmas tree in sja1105_xmit
This is a cosmetic patch that makes the dp, tx_vid, queue_mapping and
pcp local variable definitions a bit closer in length, so they don't
look like an eyesore as much.

The 'ds' variable is not used otherwise, except for ds->dp.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 15:13:13 -08:00
Vladimir Oltean
a68578c20a net: dsa: Make deferred_xmit private to sja1105
There are 3 things that are wrong with the DSA deferred xmit mechanism:

1. Its introduction has made the DSA hotpath ever so slightly more
   inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
   to be initialized to false for every transmitted frame, in order to
   figure out whether the driver requested deferral or not (a very rare
   occasion, rare even for the only driver that does use this mechanism:
   sja1105). That was necessary to avoid kfree_skb from freeing the skb.

2. Because L2 PTP is a link-local protocol like STP, it requires
   management routes and deferred xmit with this switch. But as opposed
   to STP, the deferred work mechanism needs to schedule the packet
   rather quickly for the TX timstamp to be collected in time and sent
   to user space. But there is no provision for controlling the
   scheduling priority of this deferred xmit workqueue. Too bad this is
   a rather specific requirement for a feature that nobody else uses
   (more below).

3. Perhaps most importantly, it makes the DSA core adhere a bit too
   much to the NXP company-wide policy "Innovate Where It Doesn't
   Matter". The sja1105 is probably the only DSA switch that requires
   some frames sent from the CPU to be routed to the slave port via an
   out-of-band configuration (register write) rather than in-band (DSA
   tag). And there are indeed very good reasons to not want to do that:
   if that out-of-band register is at the other end of a slow bus such
   as SPI, then you limit that Ethernet flow's throughput to effectively
   the throughput of the SPI bus. So hardware vendors should definitely
   not be encouraged to design this way. We do _not_ want more
   widespread use of this mechanism.

Luckily we have a solution for each of the 3 issues:

For 1, we can just remove that variable in the skb->cb and counteract
the effect of kfree_skb with skb_get, much to the same effect. The
advantage, of course, being that anybody who doesn't use deferred xmit
doesn't need to do any extra operation in the hotpath.

For 2, we can create a kernel thread for each port's deferred xmit work.
If the user switch ports are named swp0, swp1, swp2, the kernel threads
will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
character length limit on kernel thread names). With this, the user can
change the scheduling priority with chrt $(pidof swp2_xmit).

For 3, we can actually move the entire implementation to the sja1105
driver.

So this patch deletes the generic implementation from the DSA core and
adds a new one, more adequate to the requirements of PTP TX
timestamping, in sja1105_main.c.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 15:13:13 -08:00
Carl Huang
ce57785bf9 net: qrtr: fix len of skb_put_padto in qrtr_node_enqueue
The len used for skb_put_padto is wrong, it need to add len of hdr.

In qrtr_node_enqueue, local variable size_t len is assign with
skb->len, then skb_push(skb, sizeof(*hdr)) will add skb->len with
sizeof(*hdr), so local variable size_t len is not same with skb->len
after skb_push(skb, sizeof(*hdr)).

Then the purpose of skb_put_padto(skb, ALIGN(len, 4)) is to add add
pad to the end of the skb's data if skb->len is not aligned to 4, but
unfortunately it use len instead of skb->len, at this line, skb->len
is 32 bytes(sizeof(*hdr)) more than len, for example, len is 3 bytes,
then skb->len is 35 bytes(3 + 32), and ALIGN(len, 4) is 4 bytes, so
__skb_put_padto will do nothing after check size(35) < len(4), the
correct value should be 36(sizeof(*hdr) + ALIGN(len, 4) = 32 + 4),
then __skb_put_padto will pass check size(35) < len(36) and add 1 byte
to the end of skb's data, then logic is correct.

function of skb_push:
void *skb_push(struct sk_buff *skb, unsigned int len)
{
	skb->data -= len;
	skb->len  += len;
	if (unlikely(skb->data < skb->head))
		skb_under_panic(skb, len, __builtin_return_address(0));
	return skb->data;
}

function of skb_put_padto
static inline int skb_put_padto(struct sk_buff *skb, unsigned int len)
{
	return __skb_put_padto(skb, len, true);
}

function of __skb_put_padto
static inline int __skb_put_padto(struct sk_buff *skb, unsigned int len,
				  bool free_on_error)
{
	unsigned int size = skb->len;

	if (unlikely(size < len)) {
		len -= size;
		if (__skb_pad(skb, len, free_on_error))
			return -ENOMEM;
		__skb_put(skb, len);
	}
	return 0;
}

Signed-off-by: Carl Huang <cjhuang@codeaurora.org>
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 14:46:05 -08:00
Pablo Neira Ayuso
5acab91458 netfilter: nf_tables: unbind callbacks from flowtable destroy path
Callback unbinding needs to be done after nf_flow_table_free(),
otherwise entries are not removed from the hardware.

Update nft_unregister_flowtable_net_hooks() to call
nf_unregister_net_hook() instead since the commit/abort paths do not
deal with the callback unbinding anymore.

Add a comment to nft_flowtable_event() to clarify that
flow_offload_netdev_event() already removes the entries before the
callback unbinding.

Fixes: 8bb69f3b29 ("netfilter: nf_tables: add flowtable offload control plane")
Fixes ff4bf2f42a ("netfilter: nf_tables: add nft_unregister_flowtable_hook()")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2020-01-05 10:06:49 +01:00
wenxu
73327d47d2 netfilter: nf_flow_table_offload: fix the nat port mangle.
Shift on 32-bit word to define the port number depends on the flow
direction.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Fixes: 7acd9378dc ("netfilter: nf_flow_table_offload: Correct memcpy size for flow_overload_mangle()")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:34 +01:00
wenxu
f31ad71c44 netfilter: nf_flow_table_offload: check the status of dst_neigh
It is better to get the dst_neigh with neigh->lock and check the
nud_state is VALID. If there is not neigh previous, the lookup will
Create a non NUD_VALID with 00:00:00:00:00:00 mac.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:31 +01:00
wenxu
1b67e50601 netfilter: nf_flow_table_offload: fix incorrect ethernet dst address
Ethernet destination for original traffic takes the source ethernet address
in the reply direction. For reply traffic, this takes the source
ethernet address of the original direction.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:27 +01:00
wenxu
8ca79606cd netfilter: nft_flow_offload: fix underflow in flowtable reference counter
The .deactivate and .activate interfaces already deal with the reference
counter. Otherwise, this results in spurious "Device is busy" errors.

Fixes: a3c90f7a23 ("netfilter: nf_tables: flow offload expression")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-05 10:06:22 +01:00
Luiz Augusto von Dentz
4b6e228e29 Bluetooth: Auto tune if input MTU is set to 0
This enables the code to set the input MTU using the underline link
packet types when set to 0, previously this would likely be rejected by
the remote peer since it would be bellow the minimal of 48 for BR/EDR
or 23 for LE, that way it shall be safe to use 0 without causing any
side effects.

This is convenient for the likes of A2DP transport, see:

https://habr.com/en/post/456182/

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-04 10:51:46 +01:00
Luiz Augusto von Dentz
1efd927d66 Bluetooth: Add support for LE PHY Update Complete event
This handles LE PHY Update Complete event and store both tx_phy and
rx_phy into hci_conn.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-04 10:49:23 +01:00
Marcel Holtmann
657cc64647 Bluetooth: Remove usage of BT_ERR_RATELIMITED macro
The macro is really not needed and can be replaced with either usage of
bt_err_ratelimited or bt_dev_err_ratelimited.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-01-04 10:41:03 +01:00
Alain Michaud
36278a5d4d Bluetooth: Adding a bt_dev_warn_ratelimited macro.
The macro will be used to display rate limited warning messages in the
log.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-04 10:41:03 +01:00
Xu Wang
d2e9d229cf l2tp: Remove redundant BUG_ON() check in l2tp_pernet
Passing NULL to l2tp_pernet causes a crash via BUG_ON.
Dereferencing net in net_generic() also has the same effect.
This patch removes the redundant BUG_ON check on the same parameter.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03 12:26:07 -08:00
Xu Wang
1f62343110 net: Remove redundant BUG_ON() check in phonet_pernet
Passing NULL to phonet_pernet causes a crash via BUG_ON.
Dereferencing net in net_generic() also has the same effect.
This patch removes the redundant BUG_ON check on the same parameter.

Signed-off-by: Xu Wang <vulab@iscas.ac.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03 12:25:50 -08:00
Li RongQing
b39c78b2aa net: remove the check argument from __skb_gro_checksum_convert
The argument is always ignored, so remove it.

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03 12:24:34 -08:00
YueHaibing
4a883ccfba ethtool: remove set but not used variable 'lsettings'
Fixes gcc '-Wunused-but-set-variable' warning:

net/ethtool/linkmodes.c: In function 'ethnl_set_linkmodes':
net/ethtool/linkmodes.c:326:32: warning:
 variable 'lsettings' set but not used [-Wunused-but-set-variable]
  struct ethtool_link_settings *lsettings;
                                ^
It is never used, so remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-03 12:22:12 -08:00
Mao Wenan
d0e8bcafc8 tcp: use REXMIT_NEW instead of magic number
REXMIT_NEW is a macro for "FRTO-style
transmit of unsent/new packets", this patch
makes it more readable.

Signed-off-by: Mao Wenan <maowenan@huawei.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 16:37:23 -08:00
Wen Yang
68aab823c2 sch_cake: avoid possible divide by zero in cake_enqueue()
The variables 'window_interval' is u64 and do_div()
truncates it to 32 bits, which means it can test
non-zero and be truncated to zero for division.
The unit of window_interval is nanoseconds,
so its lower 32-bit is relatively easy to exceed.
Fix this issue by using div64_u64() instead.

Fixes: 7298de9cd7 ("sch_cake: Add ingress mode")
Signed-off-by: Wen Yang <wenyang@linux.alibaba.com>
Cc: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: cake@lists.bufferbloat.net
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 16:34:28 -08:00
David Ahern
6b102db50c net: Add device index to tcp_md5sig
Add support for userspace to specify a device index to limit the scope
of an entry via the TCP_MD5SIG_EXT setsockopt. The existing __tcpm_pad
is renamed to tcpm_ifindex and the new field is only checked if the new
TCP_MD5SIG_FLAG_IFINDEX is set in tcpm_flags. For now, the device index
must point to an L3 master device (e.g., VRF). The API and error
handling are setup to allow the constraint to be relaxed in the future
to any device index.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern
dea53bb80e tcp: Add l3index to tcp_md5sig_key and md5 functions
Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.

With the key now based on an l3index, add the new parameter to the
lookup functions and consider the l3index when looking for a match.

The l3index comes from the skb when processing ingress packets leveraging
the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
v6 variants). When the sdif index is set it means the packet ingressed a
device that is part of an L3 domain and inet_iif points to the VRF device.
For egress, the L3 domain is determined from the socket binding and
sk_bound_dev_if.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern
534322ca3d ipv4/tcp: Pass dif and sdif to tcp_v4_inbound_md5_hash
The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v4_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern
d14c77e0b2 ipv6/tcp: Pass dif and sdif to tcp_v6_inbound_md5_hash
The original ingress device index is saved to the cb space of the skb
and the cb is moved during tcp processing. Since tcp_v6_inbound_md5_hash
can be called before and after the cb move, pass dif and sdif to it so
the caller can save both prior to the cb move. Both are used by a later
patch.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
David Ahern
cea9760950 ipv4/tcp: Use local variable for tcp_md5_addr
Extract the typecast to (union tcp_md5_addr *) to a local variable
rather than the current long, inline declaration with function calls.

No functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
Pengcheng Yang
c9655008e7 tcp: fix "old stuff" D-SACK causing SACK to be treated as D-SACK
When we receive a D-SACK, where the sequence number satisfies:
	undo_marker <= start_seq < end_seq <= prior_snd_una
we consider this is a valid D-SACK and tcp_is_sackblock_valid()
returns true, then this D-SACK is discarded as "old stuff",
but the variable first_sack_index is not marked as negative
in tcp_sacktag_write_queue().

If this D-SACK also carries a SACK that needs to be processed
(for example, the previous SACK segment was lost), this SACK
will be treated as a D-SACK in the following processing of
tcp_sacktag_write_queue(), which will eventually lead to
incorrect updates of undo_retrans and reordering.

Fixes: fd6dad616d ("[TCP]: Earlier SACK block verification & simplify access to them")
Signed-off-by: Pengcheng Yang <yangpc@wangsu.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:42:28 -08:00
Jesper Dangaard Brouer
f13fc10785 page_pool: help compiler remove code in case CONFIG_NUMA=n
When kernel is compiled without NUMA support, then page_pool NUMA
config setting (pool->p.nid) doesn't make any practical sense. The
compiler cannot see that it can remove the code paths.

This patch avoids reading pool->p.nid setting in case of !CONFIG_NUMA,
in allocation and numa check code, which helps compiler to see the
optimisation potential. It leaves update code intact to keep API the
same.

 $ ./scripts/bloat-o-meter net/core/page_pool.o-numa-enabled \
                           net/core/page_pool.o-numa-disabled
 add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-113 (-113)
 Function                                     old     new   delta
 page_pool_create                             401     398      -3
 __page_pool_alloc_pages_slow                 439     426     -13
 page_pool_refill_alloc_cache                 425     328     -97
 Total: Before=3611, After=3498, chg -3.13%

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:37:52 -08:00
Jesper Dangaard Brouer
44768decb7 page_pool: handle page recycle for NUMA_NO_NODE condition
The check in pool_page_reusable (page_to_nid(page) == pool->p.nid) is
not valid if page_pool was configured with pool->p.nid = NUMA_NO_NODE.

The goal of the NUMA changes in commit d5394610b1 ("page_pool: Don't
recycle non-reusable pages"), were to have RX-pages that belongs to the
same NUMA node as the CPU processing RX-packet during softirq/NAPI. As
illustrated by the performance measurements.

This patch moves the NAPI checks out of fast-path, and at the same time
solves the NUMA_NO_NODE issue.

First realize that alloc_pages_node() with pool->p.nid = NUMA_NO_NODE
will lookup current CPU nid (Numa ID) via numa_mem_id(), which is used
as the the preferred nid.  It is only in rare situations, where
e.g. NUMA zone runs dry, that page gets doesn't get allocated from
preferred nid.  The page_pool API allows drivers to control the nid
themselves via controlling pool->p.nid.

This patch moves the NAPI check to when alloc cache is refilled, via
dequeuing/consuming pages from the ptr_ring. Thus, we can allow placing
pages from remote NUMA into the ptr_ring, as the dequeue/consume step
will check the NUMA node. All current drivers using page_pool will
alloc/refill RX-ring from same CPU running softirq/NAPI process.

Drivers that control the nid explicitly, also use page_pool_update_nid
when changing nid runtime.  To speed up transision to new nid the alloc
cache is now flushed on nid changes.  This force pages to come from
ptr_ring, which does the appropate nid check.

For the NUMA_NO_NODE case, when a NIC IRQ is moved to another NUMA
node, we accept that transitioning the alloc cache doesn't happen
immediately. The preferred nid change runtime via consulting
numa_mem_id() based on the CPU processing RX-packets.

Notice, to avoid stressing the page buddy allocator and avoid doing too
much work under softirq with preempt disabled, the NUMA check at
ptr_ring dequeue will break the refill cycle, when detecting a NUMA
mismatch. This will cause a slower transition, but its done on purpose.

Fixes: d5394610b1 ("page_pool: Don't recycle non-reusable pages")
Reported-by: Li RongQing <lirongqing@baidu.com>
Reported-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:37:52 -08:00
Lorenzo Bianconi
e322c07f83 mac80211: debugfs: improve airtime_flags handler readability
Improve airtime_flags debugfs handler readability reporting configured
airtime flags in both numeric and human readable manner

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/9df7e40b45e95bb0b320317831455beaed1ee3ee.1574872357.git.lorenzo@kernel.org
[remove AQL since it's no longer there]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-02 14:52:26 +01:00
Johannes Berg
7d6aa9ba4f Merge remote-tracking branch 'net-next/master' into mac80211-next
Merging to get the mac80211 updates that have since propagated
into net-next.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-02 14:50:58 +01:00
Markus Theil
02a6144996 mac80211: mesh: restrict airtime metric to peered established plinks
The following warning is triggered every time an unestablished mesh peer
gets dumped. Checks if a peer link is established before retrieving the
airtime link metric.

[ 9563.022567] WARNING: CPU: 0 PID: 6287 at net/mac80211/mesh_hwmp.c:345
               airtime_link_metric_get+0xa2/0xb0 [mac80211]
[ 9563.022697] Hardware name: PC Engines apu2/apu2, BIOS v4.10.0.3
[ 9563.022756] RIP: 0010:airtime_link_metric_get+0xa2/0xb0 [mac80211]
[ 9563.022838] Call Trace:
[ 9563.022897]  sta_set_sinfo+0x936/0xa10 [mac80211]
[ 9563.022964]  ieee80211_dump_station+0x6d/0x90 [mac80211]
[ 9563.023062]  nl80211_dump_station+0x154/0x2a0 [cfg80211]
[ 9563.023120]  netlink_dump+0x17b/0x370
[ 9563.023130]  netlink_recvmsg+0x2a4/0x480
[ 9563.023140]  ____sys_recvmsg+0xa6/0x160
[ 9563.023154]  ___sys_recvmsg+0x93/0xe0
[ 9563.023169]  __sys_recvmsg+0x7e/0xd0
[ 9563.023210]  do_syscall_64+0x4e/0x140
[ 9563.023217]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20191203180644.70653-1-markus.theil@tu-ilmenau.de
[rewrite commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-02 14:48:32 +01:00
Sven Eckelmann
b2e55ca892 batman-adv: Disable CONFIG_BATMAN_ADV_SYSFS by default
The sysfs support in batman-adv is deprecated since a while and will be
removed completely next year.

All tools which were known to the batman-adv development team are
supporting the batman-adv netlink interface since a while. Thus
disabling CONFIG_BATMAN_ADV_SYSFS by default should not cause problems on
most systems. It is still possible to enable it in case it is still
required in a specific setup.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-01-01 00:57:07 +01:00
Sven Eckelmann
68e039f966 batman-adv: Update copyright years for 2020
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-01-01 00:00:33 +01:00
David S. Miller
31d518f35e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes in bpf land wrt. bpf_helper_defs.h
handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-31 13:37:13 -08:00
Taehee Yoo
04b69426d8 hsr: fix slab-out-of-bounds Read in hsr_debugfs_rename()
hsr slave interfaces don't have debugfs directory.
So, hsr_debugfs_rename() shouldn't be called when hsr slave interface name
is changed.

Test commands:
    ip link add dummy0 type dummy
    ip link add dummy1 type dummy
    ip link add hsr0 type hsr slave1 dummy0 slave2 dummy1
    ip link set dummy0 name ap

Splat looks like:
[21071.899367][T22666] ap: renamed from dummy0
[21071.914005][T22666] ==================================================================
[21071.919008][T22666] BUG: KASAN: slab-out-of-bounds in hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.923640][T22666] Read of size 8 at addr ffff88805febcd98 by task ip/22666
[21071.926941][T22666]
[21071.927750][T22666] CPU: 0 PID: 22666 Comm: ip Not tainted 5.5.0-rc2+ #240
[21071.929919][T22666] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[21071.935094][T22666] Call Trace:
[21071.935867][T22666]  dump_stack+0x96/0xdb
[21071.936687][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.937774][T22666]  print_address_description.constprop.5+0x1be/0x360
[21071.939019][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940081][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.940949][T22666]  __kasan_report+0x12a/0x16f
[21071.941758][T22666]  ? hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.942674][T22666]  kasan_report+0xe/0x20
[21071.943325][T22666]  hsr_debugfs_rename+0xaa/0xb0 [hsr]
[21071.944187][T22666]  hsr_netdev_notify+0x1fe/0x9b0 [hsr]
[21071.945052][T22666]  ? __module_text_address+0x13/0x140
[21071.945897][T22666]  notifier_call_chain+0x90/0x160
[21071.946743][T22666]  dev_change_name+0x419/0x840
[21071.947496][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.948600][T22666]  ? netdev_adjacent_rename_links+0x280/0x280
[21071.949577][T22666]  ? __read_once_size_nocheck.constprop.6+0x10/0x10
[21071.950672][T22666]  ? lock_downgrade+0x6e0/0x6e0
[21071.951345][T22666]  ? do_setlink+0x811/0x2ef0
[21071.951991][T22666]  do_setlink+0x811/0x2ef0
[21071.952613][T22666]  ? is_bpf_text_address+0x81/0xe0
[ ... ]

Reported-by: syzbot+9328206518f08318a5fd@syzkaller.appspotmail.com
Fixes: 4c2d5e33dc ("hsr: rename debugfs file when interface name is changed")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:36:27 -08:00
Davide Caratti
a5b72a083d net/sched: add delete_empty() to filters and use it in cls_flower
Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:

 6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
 6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
 74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.

This reverts commit 275c44aa19.

Changes since v1:
 - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
   is used, thanks to Vlad Buslov
 - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
 - squash revert and new fix in a single patch, to be nice with bisect
   tests that run tdc on u32 filter, thanks to Dave Miller

Fixes: 275c44aa19 ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:35:19 -08:00
Vijay Khemka
9e860947d8 net/ncsi: Fix gma flag setting after response
gma_flag was set at the time of GMA command request but it should
only be set after getting successful response. Movinng this flag
setting in GMA response handler.

This flag is used mainly for not repeating GMA command once
received MAC address.

Signed-off-by: Vijay Khemka <vijaykhemka@fb.com>
Reviewed-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:34:17 -08:00
Kevin Kou
f398efc14a sctp: add enabled check for path tracepoint loop.
sctp_outq_sack is the main function handles SACK, it is called very
frequently. As the commit "move trace_sctp_probe_path into sctp_outq_sack"
added below code to this function, sctp tracepoint is disabled most of time,
but the loop of transport list will be always called even though the
tracepoint is disabled, this is unnecessary.

+	/* SCTP path tracepoint for congestion control debugging. */
+	list_for_each_entry(transport, transport_list, transports) {
+		trace_sctp_probe_path(transport, asoc);
+	}

This patch is to add tracepoint enabled check at outside of the loop of
transport list, and avoid traversing the loop when trace is disabled,
it is a small optimization.

Signed-off-by: Kevin Kou <qdkevin.kou@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:33:05 -08:00
Cambda Zhu
853697504d tcp: Fix highest_sack and highest_sack_seq
>From commit 50895b9de1 ("tcp: highest_sack fix"), the logic about
setting tp->highest_sack to the head of the send queue was removed.
Of course the logic is error prone, but it is logical. Before we
remove the pointer to the highest sack skb and use the seq instead,
we need to set tp->highest_sack to NULL when there is no skb after
the last sack, and then replace NULL with the real skb when new skb
inserted into the rtx queue, because the NULL means the highest sack
seq is tp->snd_nxt. If tp->highest_sack is NULL and new data sent,
the next ACK with sack option will increase tp->reordering unexpectedly.

This patch sets tp->highest_sack to the tail of the rtx queue if
it's NULL and new data is sent. The patch keeps the rule that the
highest_sack can only be maintained by sack processing, except for
this only case.

Fixes: 50895b9de1 ("tcp: highest_sack fix")
Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:28:39 -08:00
Eric Dumazet
f278b99ca6 tcp_cubic: refactor code to perform a divide only when needed
Neal Cardwell suggested to not change ca->delay_min
and apply the ack delay cushion only when Hystart ACK train
is still under consideration. This should avoid a 64bit
divide unless needed.

Tested:

40Gbit(mlx4) testbed (with sch_fq as packet scheduler)

$ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpaa24 -l -4000000; done;nstat|egrep "Hystart"
  14815
  16280
  15293
  15563
  11574
  15145
  14789
  18548
  16972
  12520
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       1396               0.0
$ dmesg | tail -10
[ 4873.951350] hystart_ack_train (116 > 93) delay_min 24 (+ ack_delay 69) cwnd 80
[ 4875.155379] hystart_ack_train (55 > 50) delay_min 21 (+ ack_delay 29) cwnd 160
[ 4876.333921] hystart_ack_train (69 > 62) delay_min 23 (+ ack_delay 39) cwnd 130
[ 4877.519037] hystart_ack_train (69 > 60) delay_min 22 (+ ack_delay 38) cwnd 130
[ 4878.701559] hystart_ack_train (87 > 63) delay_min 24 (+ ack_delay 39) cwnd 160
[ 4879.844597] hystart_ack_train (93 > 50) delay_min 21 (+ ack_delay 29) cwnd 216
[ 4880.956650] hystart_ack_train (74 > 67) delay_min 20 (+ ack_delay 47) cwnd 108
[ 4882.098500] hystart_ack_train (61 > 57) delay_min 23 (+ ack_delay 34) cwnd 130
[ 4883.262056] hystart_ack_train (72 > 67) delay_min 21 (+ ack_delay 46) cwnd 130
[ 4884.418760] hystart_ack_train (74 > 67) delay_min 29 (+ ack_delay 38) cwnd 152

10Gbit(bnx2x) testbed (with sch_fq as packet scheduler)

$ echo -n 'file tcp_cubic.c +p'  >/sys/kernel/debug/dynamic_debug/control
$ nstat -n;for f in {1..10}; do ./super_netperf 1 -H lpk52 -l -4000000; done;nstat|egrep "Hystart"
   7050
   7065
   7100
   6900
   7202
   7263
   7189
   6869
   7463
   7034
TcpExtTCPHystartTrainDetect     10                 0.0
TcpExtTCPHystartTrainCwnd       3199               0.0
$ dmesg | tail -10
[  176.920012] hystart_ack_train (161 > 141) delay_min 83 (+ ack_delay 58) cwnd 264
[  179.144645] hystart_ack_train (164 > 159) delay_min 120 (+ ack_delay 39) cwnd 444
[  181.354527] hystart_ack_train (214 > 168) delay_min 125 (+ ack_delay 43) cwnd 436
[  183.539565] hystart_ack_train (170 > 147) delay_min 96 (+ ack_delay 51) cwnd 326
[  185.727309] hystart_ack_train (177 > 160) delay_min 61 (+ ack_delay 99) cwnd 128
[  187.947142] hystart_ack_train (184 > 167) delay_min 123 (+ ack_delay 44) cwnd 367
[  190.166680] hystart_ack_train (230 > 153) delay_min 116 (+ ack_delay 37) cwnd 444
[  192.327285] hystart_ack_train (210 > 206) delay_min 86 (+ ack_delay 120) cwnd 152
[  194.511392] hystart_ack_train (173 > 151) delay_min 94 (+ ack_delay 57) cwnd 239
[  196.736023] hystart_ack_train (149 > 146) delay_min 105 (+ ack_delay 41) cwnd 399

Fixes: 42f3a8aaae ("tcp_cubic: tweak Hystart detection for short RTT flows")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Link: https://www.spinics.net/lists/netdev/msg621886.html
Link: https://www.spinics.net/lists/netdev/msg621797.html
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 14:44:27 -08:00
David S. Miller
ba4028105e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Remove #ifdef pollution around nf_ingress(), from Lukas Wunner.

2) Document ingress hook in netdevice, also from Lukas.

3) Remove htons() in tunnel metadata port netlink attributes,
   from Xin Long.

4) Missing erspan netlink attribute validation also from Xin Long.

5) Missing erspan version in tunnel, from Xin Long.

6) Missing attribute nest in NFTA_TUNNEL_KEY_OPTS_{VXLAN,ERSPAN}
   Patch from Xin Long.

7) Missing nla_nest_cancel() in tunnel netlink dump path,
   from Xin Long.

8) Remove two exported conntrack symbols with no clients,
   from Florian Westphal.

9) Add nft_meta_get_eval_time() helper to nft_meta, from Florian.

10) Add nft_meta_pkttype helper for loopback, also from Florian.

11) Add nft_meta_socket uid helper, from Florian Westphal.

12) Add nft_meta_cgroup helper, from Florian.

13) Add nft_meta_ifkind helper, from Florian.

14) Group all interface related meta selector, from Florian.

15) Add nft_prandom_u32() helper, from Florian.

16) Add nft_meta_rtclassid helper, from Florian.

17) Add support for matching on the slave device index,
    from Florian.

This batch, among other things, contains updates for the netfilter
tunnel netlink interface: This extension is still incomplete and lacking
proper userspace support which is actually my fault, I did not find the
time to go back and finish this. This update is breaking tunnel UAPI in
some aspects to fix it but do it better sooner than never.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 14:22:11 -08:00
Florian Westphal
1b789577f6 netfilter: arp_tables: init netns pointer in xt_tgchk_param struct
We get crash when the targets checkentry function tries to make
use of the network namespace pointer for arptables.

When the net pointer got added back in 2010, only ip/ip6/ebtables were
changed to initialize it, so arptables has this set to NULL.

This isn't a problem for normal arptables because no existing
arptables target has a checkentry function that makes use of par->net.

However, direct users of the setsockopt interface can provide any
target they want as long as its registered for ARP or UNPSEC protocols.

syzkaller managed to send a semi-valid arptables rule for RATEEST target
which is enough to trigger NULL deref:

kasan: GPF could be caused by NULL-ptr deref or user memory access
general protection fault: 0000 [#1] PREEMPT SMP KASAN
RIP: xt_rateest_tg_checkentry+0x11d/0xb40 net/netfilter/xt_RATEEST.c:109
[..]
 xt_check_target+0x283/0x690 net/netfilter/x_tables.c:1019
 check_target net/ipv4/netfilter/arp_tables.c:399 [inline]
 find_check_entry net/ipv4/netfilter/arp_tables.c:422 [inline]
 translate_table+0x1005/0x1d70 net/ipv4/netfilter/arp_tables.c:572
 do_replace net/ipv4/netfilter/arp_tables.c:977 [inline]
 do_arpt_set_ctl+0x310/0x640 net/ipv4/netfilter/arp_tables.c:1456

Fixes: add6746124 ("netfilter: add struct net * to target parameters")
Reported-by: syzbot+d7358a458d8a81aee898@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-12-30 13:09:04 +01:00
Vladimir Oltean
f685e609a3 net: dsa: Deny PTP on master if switch supports it
It is possible to kill PTP on a DSA switch completely and absolutely,
until a reboot, with a simple command:

tcpdump -i eth2 -j adapter_unsynced

where eth2 is the switch's DSA master.

Why? Well, in short, the PTP API in place today is a bit rudimentary and
relies on applications to retrieve the TX timestamps by polling the
error queue and looking at the cmsg structure. But there is no timestamp
identification of any sorts (except whether it's HW or SW), you don't
know how many more timestamps are there to come, which one is this one,
from whom it is, etc. In other words, the SO_TIMESTAMPING API is
fundamentally limited in that you can get a single HW timestamp from the
stack.

And the "-j adapter_unsynced" flag of tcpdump enables hardware
timestamping.

So let's imagine what happens when the DSA master decides it wants to
deliver TX timestamps to the skb's socket too:
- The timestamp that the user space sees is taken by the DSA master.
  Whereas the RX timestamp will eventually be overwritten by the DSA
  switch. So the RX and TX timestamps will be in different time bases
  (aka garbage).
- The user space applications have no way to deal with the second (real)
  TX timestamp finally delivered by the DSA switch, or even to know to
  wait for it.

Take ptp4l from the linuxptp project, for example. This is its behavior
after running tcpdump, before the patch:

ptp4l[172]: [6469.594] Unexpected data on socket err queue:
ptp4l[172]: [6469.693] rms    8 max   16 freq -21257 +/-  11 delay   748 +/-   0
ptp4l[172]: [6469.711] Unexpected data on socket err queue:
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 05 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.721] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b1 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.838] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 06 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.848] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 13 02
ptp4l[172]: 0010 00 36 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 04 1a 45 05 7f
ptp4l[172]: 0030 00 00 5e 05 41 32 27 c2 1a 68 00 04 9f ff fe 05
ptp4l[172]: 0040 de 06 00 01
ptp4l[172]: [6469.855] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 01 c6 b2 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: [6469.974] Unexpected data on socket err queue:
ptp4l[172]: 0000 01 80 c2 00 00 0e 00 1f 7b 63 02 48 88 f7 10 02
ptp4l[172]: 0010 00 2c 00 00 02 00 00 00 00 00 00 00 00 00 00 00
ptp4l[172]: 0020 00 00 00 1f 7b ff fe 63 02 48 00 03 aa 07 00 fd
ptp4l[172]: 0030 00 00 00 00 00 00 00 00 00 00

The ptp4l program itself is heavily patched to show this (more details
here [0]). Otherwise, by default it just hangs.

On the other hand, with the DSA patch to disallow HW timestamping
applied:

tcpdump -i eth2 -j adapter_unsynced
tcpdump: SIOCSHWTSTAMP failed: Device or resource busy

So it is a fact of life that PTP timestamping on the DSA master is
incompatible with timestamping on the switch MAC, at least with the
current API. And if the switch supports PTP, taking the timestamps from
the switch MAC is highly preferable anyway, due to the fact that those
don't contain the queuing latencies of the switch. So just disallow PTP
on the DSA master if there is any PTP-capable switch attached.

[0]: https://sourceforge.net/p/linuxptp/mailman/message/36880648/

Fixes: 0336369d3a ("net: dsa: forward hardware timestamping ioctls to switch driver")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Richard Cochran <richardcochran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-28 11:43:41 -08:00
Michal Kubecek
3d2b847fb9 ethtool: provide link state with LINKSTATE_GET request
Implement LINKSTATE_GET netlink request to get link state information.

At the moment, only link up flag as provided by ETHTOOL_GLINK ioctl command
is returned.

LINKSTATE_GET request can be used with NLM_F_DUMP (without device
identification) to request the information for all devices in current
network namespace providing the data.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
1b1b1847c8 ethtool: add LINKMODES_NTF notification
Send ETHTOOL_MSG_LINKMODES_NTF notification message whenever device link
settings or advertised modes are modified using ETHTOOL_MSG_LINKMODES_SET
netlink message or ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.

The notification message has the same format as reply to LINKMODES_GET
request. ETHTOOL_MSG_LINKMODES_SET netlink request only triggers the
notification if there is a change but the ioctl command handlers do not
check if there is an actual change and trigger the notification whenever
the commands are executed.

As all work is done by ethnl_default_notify() handler and callback
functions introduced to handle LINKMODES_GET requests, all that remains is
adding entries for ETHTOOL_MSG_LINKMODES_NTF into ethnl_notify_handlers and
ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
needed.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
bfbcfe2032 ethtool: set link modes related data with LINKMODES_SET request
Implement LINKMODES_SET netlink request to set advertised linkmodes and
related attributes as ETHTOOL_SLINKSETTINGS and ETHTOOL_SSET commands do.

The request allows setting autonegotiation flag, speed, duplex and
advertised link modes.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
f625aa9be8 ethtool: provide link mode information with LINKMODES_GET request
Implement LINKMODES_GET netlink request to get link modes related
information provided by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl
commands.

This request provides supported, advertised and peer advertised link modes,
autonegotiation flag, speed and duplex.

LINKMODES_GET request can be used with NLM_F_DUMP (without device
identification) to request the information for all devices in current
network namespace providing the data.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
73286734c1 ethtool: add LINKINFO_NTF notification
Send ETHTOOL_MSG_LINKINFO_NTF notification message whenever device link
settings are modified using ETHTOOL_MSG_LINKINFO_SET netlink message or
ETHTOOL_SLINKSETTINGS or ETHTOOL_SSET ioctl commands.

The notification message has the same format as reply to LINKINFO_GET
request. ETHTOOL_MSG_LINKINFO_SET netlink request only triggers the
notification if there is a change but the ioctl command handlers do not
check if there is an actual change and trigger the notification whenever
the commands are executed.

As all work is done by ethnl_default_notify() handler and callback
functions introduced to handle LINKINFO_GET requests, all that remains is
adding entries for ETHTOOL_MSG_LINKINFO_NTF into ethnl_notify_handlers and
ethnl_default_notify_ops lookup tables and calls to ethtool_notify() where
needed.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
5cf2a548bc ethtool: add default notification handler
The ethtool netlink notifications have the same format as related GET
replies so that if generic GET handling framework is used to process GET
requests, its callbacks and instance of struct get_request_ops can be
also used to compose corresponding notification message.

Provide function ethnl_std_notify() to be used as notification handler in
ethnl_notify_handlers table.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
a53f3d41e4 ethtool: set link settings with LINKINFO_SET request
Implement LINKINFO_SET netlink request to set link settings queried by
LINKINFO_GET message.

Only physical port, phy MDIO address and MDI(-X) control can be set,
attempt to modify MDI(-X) status and transceiver is rejected.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00
Michal Kubecek
459e0b81b3 ethtool: provide link settings with LINKINFO_GET request
Implement LINKINFO_GET netlink request to get basic link settings provided
by ETHTOOL_GLINKSETTINGS and ETHTOOL_GSET ioctl commands.

This request provides settings not directly related to autonegotiation and
link mode selection: physical port, phy MDIO address, MDI(-X) status,
MDI(-X) control and transceiver.

LINKINFO_GET request can be used with NLM_F_DUMP (without device
identification) to request the information for all devices in current
network namespace providing the data.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 16:40:02 -08:00