Commit Graph

62755 Commits

Author SHA1 Message Date
Julia Lawall
3cec036990 RDS: drop double zeroing
sg_init_table zeroes its first argument, so the allocation of that argument
doesn't have to.

the semantic patch that makes this change is as follows:
(http://coccinelle.lip6.fr/)

// <smpl>
@@
expression x,n,flags;
@@

x =
- kcalloc
+ kmalloc_array
  (n,sizeof(*x),flags)
...
sg_init_table(x,n)
// </smpl>

Signed-off-by: Julia Lawall <Julia.Lawall@inria.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:09:11 -07:00
Vladimir Oltean
88525fc01c net: dsa: tag_sja1105: add compatibility with hwaccel VLAN tags
Check whether there is any hwaccel VLAN tag on RX, and if there is,
treat it as the tag_8021q header.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:34 -07:00
Vladimir Oltean
bbed0bbddd net: dsa: tag_8021q: add VLANs to the master interface too
The whole purpose of tag_8021q is to send VLAN-tagged traffic to the
CPU, from which the driver can decode the source port and switch id.

Currently this only works if the VLAN filtering on the master is
disabled. Change that by explicitly adding code to tag_8021q.c to add
the VLANs corresponding to the tags to the filter of the master
interface.

Because we now need to call vlan_vid_add, then we also need to hold the
RTNL mutex. Propagate that requirement to the callers of dsa_8021q_setup
and modify the existing call sites as appropriate. Note that one call
path, sja1105_best_effort_vlan_filtering_set -> sja1105_vlan_filtering
-> sja1105_setup_8021q_tagging, was already holding this lock.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:34 -07:00
Vladimir Oltean
2209158c90 net: dsa: install VLANs into the master's RX filter too
Most DSA switch tags shift the EtherType to the right, causing the
master to not parse the VLAN as VLAN.
However, not all switches do that (example: tail tags, tag_8021q etc),
and if the DSA master has "rx-vlan-filter: on" in ethtool -k, then we
have a problem.

Therefore, we could populate the VLAN table of the master, just in case
(for some switches it will not make a difference), so that network I/O
can work even with a VLAN filtering master.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:34 -07:00
Vladimir Oltean
adb256eb17 net: dsa: allow 8021q uppers while the bridge has vlan_filtering=0
When the bridge has VLAN awareness disabled there isn't any duplication
of functionality, since the bridge does not process VLAN. Don't deny
adding 8021q uppers to DSA switch ports in that case. The switch is
supposed to simply pass traffic leaving the VLAN tag as-is, and the
stack will happily strip the VLAN tag for all 8021q uppers that exist.

We need to ensure that there are no 8021q uppers when the user attempts
to enable bridge vlan_filtering.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:34 -07:00
Vladimir Oltean
707ec383b3 net: dsa: refuse configuration in prepare phase of dsa_port_vlan_filtering()
The current logic beats me a little bit. The comment that "bridge skips
-EOPNOTSUPP, so skip the prepare phase" was introduced in commit
fb2dabad69 ("net: dsa: support VLAN filtering switchdev attr").

I'm not sure:
(a) ok, the bridge skips -EOPNOTSUPP, but, so what, where are we
    returning -EOPNOTSUPP?
(b) even if we are, and I'm just not seeing it, what is the causality
    relationship between the bridge skipping -EOPNOTSUPP and DSA
    skipping the prepare phase, and just returning zero?

One thing is certain beyond doubt though, and that is that DSA currently
refuses VLAN filtering from the "commit" phase instead of "prepare", and
that this is not a good thing:

ip link add br0 type bridge
ip link add br1 type bridge vlan_filtering 1
ip link set swp2 master br0
ip link set swp3 master br1
[ 3790.379389] 001: sja1105 spi0.1: VLAN filtering is a global setting
[ 3790.379399] 001: ------------[ cut here ]------------
[ 3790.379403] 001: WARNING: CPU: 1 PID: 515 at net/switchdev/switchdev.c:157 switchdev_port_attr_set_now+0x9c/0xa4
[ 3790.379420] 001: swp3: Commit of attribute (id=6) failed.
[ 3790.379533] 001: [<c11ff588>] (switchdev_port_attr_set_now) from [<c11b62e4>] (nbp_vlan_init+0x84/0x148)
[ 3790.379544] 001: [<c11b62e4>] (nbp_vlan_init) from [<c11a2ff0>] (br_add_if+0x514/0x670)
[ 3790.379554] 001: [<c11a2ff0>] (br_add_if) from [<c1031b5c>] (do_setlink+0x38c/0xab0)
[ 3790.379565] 001: [<c1031b5c>] (do_setlink) from [<c1036fe8>] (__rtnl_newlink+0x44c/0x748)
[ 3790.379573] 001: [<c1036fe8>] (__rtnl_newlink) from [<c1037328>] (rtnl_newlink+0x44/0x60)
[ 3790.379580] 001: [<c1037328>] (rtnl_newlink) from [<c10315fc>] (rtnetlink_rcv_msg+0x124/0x2f8)
[ 3790.379590] 001: [<c10315fc>] (rtnetlink_rcv_msg) from [<c10926b8>] (netlink_rcv_skb+0xb8/0x110)
[ 3790.379806] 001: ---[ end trace 0000000000000002 ]---
[ 3790.379819] 001: sja1105 spi0.1 swp3: failed to initialize vlan filtering on this port

So move the current logic that may fail (except ds->ops->port_vlan_filtering,
that is way harder) into the prepare stage of the switchdev transaction.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:34 -07:00
Vladimir Oltean
1ce39f0ee8 net: dsa: convert denying bridge VLAN with existing 8021q upper to PRECHANGEUPPER
This is checking for the following order of operations, and makes sure
to deny that configuration:

ip link add link swp2 name swp2.100 type vlan id 100
ip link add br0 type bridge vlan_filtering 1
ip link set swp2 master br0
bridge vlan add dev swp2 vid 100

Instead of using vlan_for_each(), which looks at the VLAN filters
installed with vlan_vid_add(), just track the 8021q uppers. This has the
advantage of freeing up the vlan_vid_add() call for actual VLAN
filtering.

There is another change in this patch. The check is moved in slave.c,
from switch.c. I don't think it makes sense to have this 8021q upper
check for each switch port that gets notified of that VLAN addition
(these include DSA links and CPU ports, we know those can't have 8021q
uppers because they don't have a net_device registered for them), so
just do it in slave.c, for that one slave interface.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:33 -07:00
Vladimir Oltean
2b13840672 net: dsa: convert check for 802.1Q upper when bridged into PRECHANGEUPPER
DSA tries to prevent having a VLAN added by a bridge and by an 802.1Q
upper at the same time. It does that by checking the VID in
.ndo_vlan_rx_add_vid(), since that's something that the 8021q module
calls, via vlan_vid_add(). When a VLAN matches in both subsystems, this
check returns -EBUSY.

However the vlan_vid_add() function isn't specific to the 8021q module
in any way at all. It is simply the kernel's way to tell an interface to
add a VLAN to its RX filter and not drop that VLAN. So there's no reason
to return -EBUSY when somebody tries to call vlan_vid_add() for a VLAN
that was installed by the bridge. The proper behavior is to accept that
configuration.

So what's wrong is how DSA checks that it has an 8021q upper. It should
look at the actual uppers for that, not just assume that the 8021q
module was somewhere in the call stack of .ndo_vlan_rx_add_vid().

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:33 -07:00
Vladimir Oltean
eb46e8da1d net: dsa: rename dsa_slave_upper_vlan_check to something more suggestive
We'll be adding a new check in the PRECHANGEUPPER notifier, where we'll
need to check some VLAN uppers. It is hard to do that when there is
already a function named dsa_slave_upper_vlan_check. So rename this one.

Not to mention that this function probably shouldn't have started with
"dsa_slave_" in the first place, since the struct net_device argument
isn't a DSA slave, but an 8021q upper of one.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:33 -07:00
Vladimir Oltean
8350129930 net: dsa: deny enslaving 802.1Q upper to VLAN-aware bridge from PRECHANGEUPPER
There doesn't seem to be any strong technical reason for doing it this
way, but we'll be adding more checks for invalid upper device
configurations, and it will be easier to have them all grouped under
PRECHANGEUPPER.

Tested that it still works:
ip link set br0 type bridge vlan_filtering 1
ip link add link swp2 name swp2.100 type vlan id 100
ip link set swp2.100 master br0
[   20.321312] br0: port 5(swp2.100) entered blocking state
[   20.326711] br0: port 5(swp2.100) entered disabled state
Error: dsa_core: Cannot enslave VLAN device into VLAN aware bridge.
[   20.346549] br0: port 5(swp2.100) entered blocking state
[   20.351957] br0: port 5(swp2.100) entered disabled state

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 19:01:33 -07:00
Yunsheng Lin
1f14bd99e6 net: remove unnecessary NULL checking in napi_consume_skb()
When budget is non-zero, skb_unref() has already handled the
NULL checking.

When budget is zero, the dev_consume_skb_any() has handled NULL
checking in __dev_kfree_skb_irq(), or dev_kfree_skb() which also
ultimately call skb_unref().

So remove the unnecessary checking in napi_consume_skb().

Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 14:17:20 -07:00
Henry Ptasinski
fe81d9f618 net: sctp: Fix IPv6 ancestor_size calc in sctp_copy_descendant
When calculating ancestor_size with IPv6 enabled, simply using
sizeof(struct ipv6_pinfo) doesn't account for extra bytes needed for
alignment in the struct sctp6_sock. On x86, there aren't any extra
bytes, but on ARM the ipv6_pinfo structure is aligned on an 8-byte
boundary so there were 4 pad bytes that were omitted from the
ancestor_size calculation.  This would lead to corruption of the
pd_lobby pointers, causing an oops when trying to free the sctp
structure on socket close.

Fixes: 636d25d557 ("sctp: not copy sctp_sock pd_lobby in sctp_copy_descendant")
Signed-off-by: Henry Ptasinski <hptasinski@google.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-20 14:15:12 -07:00
Sonny Sasaka
a46b7ed4d5 Bluetooth: Fix auto-creation of hci_conn at Conn Complete event
Currently the code auto-creates hci_conn only if the remote address has
been discovered before. This may not be the case. For example, the
remote device may trigger connection after reboot at already-paired
state so there is no inquiry result found, but it is still correct to
create the hci_conn when Connection Complete event is received.

A better guard is to check against bredr allowlist. Devices in the
allowlist have been given permission to auto-connect.

Fixes: 4f40afc6c7 ("Bluetooth: Handle BR/EDR devices during suspend")
Signed-off-by: Sonny Sasaka <sonnysasaka@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-20 08:21:14 +02:00
Howard Chung
291f0c550f Bluetooth: Set scan parameters for ADV Monitor
Set scan parameters when there is at least one Advertisement monitor.

Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-20 08:12:00 +02:00
Miao-chen Chou
c88e397968 Bluetooth: Update Adv monitor count upon removal
This fixes the count of Adv monitor upon monitor removal.

The following test was performed.
- Start two btmgmt consoles, issue a btmgmt advmon-remove command on one
console and observe a MGMT_EV_ADV_MONITOR_REMOVED event on the other.

Signed-off-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Howard Chung <howardchung@google.com>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-20 08:11:59 +02:00
Dan Murphy
55f1331178 ethtool: Add 100base-FX link mode entries
Add entries for the 100base-FX full and half duplex supported modes.

$ ethtool eth0
        Supported ports: [ FIBRE ]
        Supported link modes:  100baseFX/Half 100baseFX/Full
        Supported pause frame use: Symmetric Receive-only
        Supports auto-negotiation: No
        Supported FEC modes: Not reported
        Advertised link modes: 100baseFX/Half 100baseFX/Full
        Advertised pause frame use: No
        Advertised auto-negotiation: No
        Advertised FEC modes: Not reported
        Speed: 100Mb/s
        Duplex: Full
        Auto-negotiation: off
        Port: MII
        PHYAD: 1
        Transceiver: external
        Supports Wake-on: gs
        Wake-on: d
        SecureOn password: 00:00:00:00:00:00
        Current message level: 0x00000000 (0)

        Link detected: yes

Signed-off-by: Dan Murphy <dmurphy@ti.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-19 16:54:35 -07:00
Xie He
09599729ee net/packet: Fix a comment about network_header
skb->nh.raw has been renamed as skb->network_header in 2007, in
commit b0e380b1d8 ("[SK_BUFF]: unions of just one member don't get
                      anything done, kill them")

So here we change it to the new name.

Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-19 16:40:48 -07:00
Andrew Lunn
0f06b855a9 net: dsa: wire up devlink info get
Allow the DSA drivers to implement the devlink call to get info info,
e.g. driver name, firmware version, ASIC ID, etc.

v2:
Combine declaration and the assignment on a single line.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 18:18:30 -07:00
Andrew Lunn
97c82c2313 net: dsa: Add devlink regions support to DSA
Allow DSA drivers to make use of devlink regions, via simple wrappers.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 18:17:45 -07:00
Andrew Lunn
ccc3e6b019 net: dsa: Add helper to convert from devlink to ds
Given a devlink instance, return the dsa switch it is associated to.

Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 18:17:45 -07:00
Andrew Lunn
d4602a9f47 net: devlink: region: Pass the region ops to the snapshot function
Pass the region to be snapshotted to the function performing the
snapshot. This allows one function to operate on numerous regions.

v4:
Add missing kerneldoc for ICE

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 18:17:45 -07:00
Necip Fazil Yildiran
db7cd91a4b net: ipv6: fix kconfig dependency warning for IPV6_SEG6_HMAC
When IPV6_SEG6_HMAC is enabled and CRYPTO is disabled, it results in the
following Kbuild warning:

WARNING: unmet direct dependencies detected for CRYPTO_HMAC
  Depends on [n]: CRYPTO [=n]
  Selected by [y]:
  - IPV6_SEG6_HMAC [=y] && NET [=y] && INET [=y] && IPV6 [=y]

WARNING: unmet direct dependencies detected for CRYPTO_SHA1
  Depends on [n]: CRYPTO [=n]
  Selected by [y]:
  - IPV6_SEG6_HMAC [=y] && NET [=y] && INET [=y] && IPV6 [=y]

WARNING: unmet direct dependencies detected for CRYPTO_SHA256
  Depends on [n]: CRYPTO [=n]
  Selected by [y]:
  - IPV6_SEG6_HMAC [=y] && NET [=y] && INET [=y] && IPV6 [=y]

The reason is that IPV6_SEG6_HMAC selects CRYPTO_HMAC, CRYPTO_SHA1, and
CRYPTO_SHA256 without depending on or selecting CRYPTO while those configs
are subordinate to CRYPTO.

Honor the kconfig menu hierarchy to remove kconfig dependency warnings.

Fixes: bf355b8d2c ("ipv6: sr: add core files for SR HMAC support")
Signed-off-by: Necip Fazil Yildiran <fazilyildiran@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 17:45:04 -07:00
David S. Miller
8f623a10c3 Here are some batman-adv bugfixes:
- fix wrong type use in backbone_gw hash, by Linus Luessing
 
  - disable TT re-routing for multicast packets, by Linus Luessing
 
  - Add missing include for in_interrupt(), by Sven Eckelmann
 
  - fix BLA/multicast issues for packets sent via unicast,
    by Linus Luessing (3 patches)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl9ks7UWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoePDEACQyp9GN3N2VvqfwZI8GQUPuFTY
 L7JjyzIDAL9B2zx25YGicznKmnjnIuM37kX9YEWWc+PPJTrc0d3CV/jEzqwIFxBo
 2stdB+x7idVgxI2drmOsFou5fxNknMxduW1CxQmr/V9c2PAkF/qkiZYLdHjSddVN
 WwMVnmWIP95cfkBZ1TsmxUdAh/vftjBGF0InbveX5ETsuPCXOVnApokwg6PqYGA8
 fZ0YorZx5h16JIlhuv8//yNb12w677AuC2ze1MXfFU165Xh8A4seUTLtFUZ7muTj
 0i5Bjc+yKzKUYjXOy6lIu6vWS3ogsvDedNTIWVcJ4K6hmniz6gzQggSBtiN8qt+O
 jwp8m3Rcci9gZo4UfBlOclJic2ywI6SecUFYxEdjHtGdl3vHzitp/tgVkTTjbI+v
 vCXThn1iC7g5TrheroYpWfdFOyChL5i3RmfE4ypOGNeND6ZcSYBrmAI1JbsPcEv4
 07qF5o61NReNWspCXMKjOjRNEiTeV3/lNOjIdzTCFajC208hyKptEZK7CH9FZPqr
 d3QHzr/xDce5kzVk2xiH/LTXcpZT4BjuKbCIaobsjuQCn3i2D4LGeE4L5Ek/WIX8
 yW4G8hXW2AgXeReGIn96WhCPLm8NtviAsmN9DFvN0tBmMQmbpyYFPzzGdtCGId00
 S87ikl+G+OD+Q/BCMA==
 =JjK9
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20200918' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - fix wrong type use in backbone_gw hash, by Linus Luessing

 - disable TT re-routing for multicast packets, by Linus Luessing

 - Add missing include for in_interrupt(), by Sven Eckelmann

 - fix BLA/multicast issues for packets sent via unicast,
   by Linus Luessing (3 patches)
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:57:59 -07:00
Wang Hai
5f3666e839 net: tipc: Supply missing udp_media.h include file
If the header file containing a function's prototype isn't included by
the sourcefile containing the associated function, the build system
complains of missing prototypes.

Fixes the following W=1 kernel build warning(s):

net/tipc/udp_media.c:446:5: warning: no previous prototype for ‘tipc_udp_nl_dump_remoteip’ [-Wmissing-prototypes]
net/tipc/udp_media.c:532:5: warning: no previous prototype for ‘tipc_udp_nl_add_bearer_data’ [-Wmissing-prototypes]
net/tipc/udp_media.c:614:5: warning: no previous prototype for ‘tipc_udp_nl_bearer_add’ [-Wmissing-prototypes]

Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:56:44 -07:00
YueHaibing
7eae7f72ed tipc: Remove unused macro CF_SERVER
It is no used any more, so can remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:56:25 -07:00
Tom Parkin
f52e4b27d1 l2tp: fix up inconsistent rx/tx statistics
Historically L2TP core statistics count the L2TP header in the
per-session and per-tunnel byte counts tracked for transmission and
receipt.

Now that l2tp_xmit_skb updates tx stats, it is necessary for
l2tp_xmit_core to pass out the length of the transmitted packet so that
the statistics can be updated correctly.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:36:54 -07:00
Zeng Tao
7b066d173b net: openswitch: reuse the helper variable to improve the code readablity
In the function ovs_ct_limit_exit, there is already a helper vaibale
which could be reused to improve the readability, so i fix it in this
patch.

Signed-off-by: Zeng Tao <prime.zeng@hisilicon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:24:08 -07:00
Randy Dunlap
4bbd026cb9 net: bridge: delete duplicated words
Drop repeated words in net/bridge/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Roopa Prabhu <roopa@nvidia.com>
Cc: Nikolay Aleksandrov <nikolay@nvidia.com>
Cc: bridge@lists.linux-foundation.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Randy Dunlap
563f63e396 net: atm: delete duplicated words
Drop repeated words in net/atm/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Chas Williams <3chas3@gmail.com>
Cc: linux-atm-general@lists.sourceforge.net
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Randy Dunlap
6046219116 net: tipc: delete duplicated words
Drop repeated words in net/tipc/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: tipc-discussion@lists.sourceforge.net
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Randy Dunlap
bb6d6895e2 net: bluetooth: delete duplicated words
Drop repeated words in net/bluetooth/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru>
Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Randy Dunlap
634a63e73f net: ipv6: delete duplicated words
Drop repeated words in net/ipv6/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Randy Dunlap
d936b1d537 net: rds: delete duplicated words
Drop repeated words in net/rds/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: linux-rdma@vger.kernel.org
Cc: rds-devel@oss.oracle.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Randy Dunlap
4250b75b40 net: core: delete duplicated words
Drop repeated words in net/core/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 14:12:43 -07:00
Tuong Lien
23700da29b tipc: add automatic rekeying for encryption key
Rekeying is required for security since a key is less secure when using
for a long time. Also, key will be detached when its nonce value (or
seqno ...) is exhausted. We now make the rekeying process automatic and
configurable by user.

Basically, TIPC will at a specific interval generate a new key by using
the kernel 'Random Number Generator' cipher, then attach it as the node
TX key and securely distribute to others in the cluster as RX keys (-
the key exchange). The automatic key switching will then take over, and
make the new key active shortly. Afterwards, the traffic from this node
will be encrypted with the new session key. The same can happen in peer
nodes but not necessarily at the same time.

For simplicity, the automatically generated key will be initiated as a
per node key. It is not too hard to also support a cluster key rekeying
(e.g. a given node will generate a unique cluster key and update to the
others in the cluster...), but that doesn't bring much benefit, while a
per-node key is even more secure.

We also enable user to force a rekeying or change the rekeying interval
via netlink, the new 'set key' command option: 'TIPC_NLA_NODE_REKEYING'
is added for these purposes as follows:
- A value >= 1 will be set as the rekeying interval (in minutes);
- A value of 0 will disable the rekeying;
- A value of 'TIPC_REKEYING_NOW' (~0) will force an immediate rekeying;

The default rekeying interval is (60 * 24) minutes i.e. done every day.
There isn't any restriction for the value but user shouldn't set it too
small or too large which results in an "ineffective" rekeying (thats ok
for testing though).

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:58:37 -07:00
Tuong Lien
1ef6f7c939 tipc: add automatic session key exchange
With support from the master key option in the previous commit, it
becomes easy to make frequent updates/exchanges of session keys between
authenticated cluster nodes.
Basically, there are two situations where the key exchange will take in
place:

- When a new node joins the cluster (with the master key), it will need
  to get its peer's TX key, so that be able to decrypt further messages
  from that peer.

- When a new session key is generated (by either user manual setting or
  later automatic rekeying feature), the key will be distributed to all
  peer nodes in the cluster.

A key to be exchanged is encapsulated in the data part of a 'MSG_CRYPTO
/KEY_DISTR_MSG' TIPC v2 message, then xmit-ed as usual and encrypted by
using the master key before sending out. Upon receipt of the message it
will be decrypted in the same way as regular messages, then attached as
the sender's RX key in the receiver node.

In this way, the key exchange is reliable by the link layer, as well as
security, integrity and authenticity by the crypto layer.

Also, the forward security will be easily achieved by user changing the
master key actively but this should not be required very frequently.

The key exchange feature is independent on the presence of a master key
Note however that the master key still is needed for new nodes to be
able to join the cluster. It is also optional, and can be turned off/on
via the sysfs: 'net/tipc/key_exchange_enabled' [default 1: enabled].

Backward compatibility is guaranteed because for nodes that do not have
master key support, key exchange using master key ie. tx_key = 0 if any
will be shortly discarded at the message validation step. In other
words, the key exchange feature will be automatically disabled to those
nodes.

v2: fix the "implicit declaration of function 'tipc_crypto_key_flush'"
error in node.c. The function only exists when built with the TIPC
"CONFIG_TIPC_CRYPTO" option.

v3: use 'info->extack' for a message emitted due to netlink operations
instead (- David's comment).

Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:58:37 -07:00
Tuong Lien
daef1ee379 tipc: introduce encryption master key
In addition to the supported cluster & per-node encryption keys for the
en/decryption of TIPC messages, we now introduce one option for user to
set a cluster key as 'master key', which is simply a symmetric key like
the former but has a longer life cycle. It has two purposes:

- Authentication of new member nodes in the cluster. New nodes, having
  no knowledge of current session keys in the cluster will still be
  able to join the cluster as long as they know the master key. This is
  because all neighbor discovery (LINK_CONFIG) messages must be
  encrypted with this key.

- Encryption of session encryption keys during automatic exchange and
  update of those.This is a feature we will introduce in a later commit
  in this series.

We insert the new key into the currently unused slot 0 in the key array
and start using it immediately once the user has set it.
After joining, a node only knowing the master key should be fully
communicable to existing nodes in the cluster, although those nodes may
have their own session keys activated (i.e. not the master one). To
support this, we define a 'grace period', starting from the time a node
itself reports having no RX keys, so the existing nodes will use the
master key for encryption instead. The grace period can be extended but
will automatically stop after e.g. 5 seconds without a new report. This
is also the basis for later key exchanging feature as the new node will
be impossible to decrypt anything without the support from master key.

For user to set a master key, we define a new netlink flag -
'TIPC_NLA_NODE_KEY_MASTER', so it can be added to the current 'set key'
netlink command to specify the setting key to be a master key.

Above all, the traditional cluster/per-node key mechanism is guaranteed
to work when user comes not to use this master key option. This is also
compatible to legacy nodes without the feature supported.

Even this master key can be updated without any interruption of cluster
connectivity but is so is needed, this has to be coordinated and set by
the user.

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:58:37 -07:00
Tuong Lien
f779bf7922 tipc: optimize key switching time and logic
We reduce the lasting time for a pending TX key to be active as well as
for a passive RX key to be freed which generally helps speed up the key
switching. It is not expected to be too fast but should not be too slow
either. Also the key handling logic is simplified that a pending RX key
will be removed automatically if it is found not working after a number
of times; the probing for a pending TX key is now carried on a specific
message user ('LINK_PROTOCOL' or 'LINK_CONFIG') which is more efficient
than using a timer on broadcast messages, the timer is reserved for use
later as needed.

The kernel logs or 'pr***()' are now made as clear as possible to user.
Some prints are added, removed or changed to the debug-level. The
'TIPC_CRYPTO_DEBUG' definition is removed, and the 'pr_debug()' is used
instead which will be much helpful in runtime.

Besides we also optimize the code in some other places as a preparation
for later commits.

v2: silent more kernel logs, also use 'info->extack' for a message
emitted due to netlink operations instead (- David's comments).

Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:58:37 -07:00
Shannon Nelson
6700acc5f1 devlink: collect flash notify params into a struct
The dev flash status notify function parameter lists are getting
rather long, so add a struct to be filled and passed rather than
continuously changing the function signatures.

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:54:23 -07:00
Shannon Nelson
f92970c694 devlink: add timeout information to status_notify
Add a timeout element to the DEVLINK_CMD_FLASH_UPDATE_STATUS
netlink message for use by a userland utility to show that
a particular firmware flash activity may take a long but
bounded time to finish.  Also add a handy helper for drivers
to make use of the new timeout value.

UI usage hints:
 - if non-zero, add timeout display to the end of the status line
 	[component] status_msg  ( Xm Ys : Am Bs )
     using the timeout value for Am Bs and updating the Xm Ys
     every second
 - if the timeout expires while awaiting the next update,
   display something like
 	[component] status_msg  ( timeout reached : Am Bs )
 - if new status notify messages are received, remove
   the timeout and start over

Signed-off-by: Shannon Nelson <snelson@pensando.io>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:54:22 -07:00
Vladimir Oltean
6565243c06 net: mscc: ocelot: add locking for the port TX timestamp ID
The ocelot_port->ts_id is used to:
(a) populate skb->cb[0] for matching the TX timestamp in the PTP IRQ
    with an skb.
(b) populate the REW_OP from the injection header of the ongoing skb.
Only then is ocelot_port->ts_id incremented.

This is a problem because, at least theoretically, another timestampable
skb might use the same ocelot_port->ts_id before that is incremented.
Normally all transmit calls are serialized by the netdev transmit
spinlock, but in this case, ocelot_port_add_txtstamp_skb() is also
called by DSA, which has started declaring the NETIF_F_LLTX feature
since commit 2b86cb8299 ("net: dsa: declare lockless TX feature for
slave ports").  So the logic of using and incrementing the timestamp id
should be atomic per port.

The solution is to use the global ocelot_port->ts_id only while
protected by the associated ocelot_port->ts_id_lock. That's where we
populate skb->cb[0]. Note that for ocelot, ocelot_port_add_txtstamp_skb
is called for the actual skb, but for felix, it is called for the skb's
clone. That is something which will also be changed in the future.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:52:33 -07:00
Francesco Ruggeri
0e4be9e57e net: use exponential backoff in netdev_wait_allrefs
The combination of aca_free_rcu, introduced in commit 2384d02520
("net/ipv6: Add anycast addresses to a global hashtable"), and
fib6_info_destroy_rcu, introduced in commit 9b0a8da8c4 ("net/ipv6:
respect rcu grace period before freeing fib6_info"), can result in
an extra rcu grace period being needed when deleting an interface,
with the result that netdev_wait_allrefs ends up hitting the msleep(250),
which is considerably longer than the required grace period.
This can result in long delays when deleting a large number of interfaces,
and it can be observed with this script:

ns=dummy-ns
NIFS=100

ip netns add $ns
ip netns exec $ns ip link set lo up
ip netns exec $ns sysctl net.ipv6.conf.default.disable_ipv6=0
ip netns exec $ns sysctl net.ipv6.conf.default.forwarding=1

for ((i=0; i<$NIFS; i++))
do
        if=eth$i
        ip netns exec $ns ip link add $if type dummy
        ip netns exec $ns ip link set $if up
        ip netns exec $ns ip -6 addr add 2021:$i::1/120 dev $if
done

for ((i=0; i<$NIFS; i++))
do
        if=eth$i
        ip netns exec $ns ip link del $if
done

ip netns del $ns

Instead of using a fixed msleep(250), this patch tries an extra
rcu_barrier() followed by an exponential backoff.

Time with this patch on a 5.4 kernel:

real	0m7.704s
user	0m0.385s
sys	0m1.230s

Time without this patch:

real    0m31.522s
user    0m0.438s
sys     0m1.156s

v2: use exponential backoff instead of trying to wake up
    netdev_wait_allrefs.
v3: preserve reverse christmas tree ordering of local variables
v4: try an extra rcu_barrier before the backoff, plus some
    cosmetic changes.

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-18 13:47:31 -07:00
Aloka Dixit
632189a018 mac80211: Unsolicited broadcast probe response support
This patch adds mac80211 support to configure unsolicited
broadcast probe response transmission for in-band discovery in 6GHz.

Changes include functions to store and retrieve probe response template,
and packet interval (0 - 20 TUs).
Setting interval to 0 disables the unsolicited broadcast probe response
transmission.

Signed-off-by: Aloka Dixit <alokad@codeaurora.org>
Link: https://lore.kernel.org/r/010101747a946b35-ad25858a-1f1f-48df-909e-dc7bf26d9169-000000@us-west-2.amazonses.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 14:06:20 +02:00
Aloka Dixit
7443dcd1f1 nl80211: Unsolicited broadcast probe response support
This patch adds new attributes to support unsolicited broadcast
probe response transmission used for in-band
discovery in 6GHz band (IEEE P802.11ax/D6.0 26.17.2.3.2, AP behavior for
fast passive scanning).
The new attribute, NL80211_ATTR_UNSOL_BCAST_PROBE_RESP, is nested which
supports following parameters:
(1) NL80211_UNSOL_BCAST_PROBE_RESP_ATTR_INT - Packet interval
(2) NL80211_UNSOL_BCAST_PROBE_RESP_ATTR_TMPL - Template data

Signed-off-by: Aloka Dixit <alokad@codeaurora.org>
Link: https://lore.kernel.org/r/010101747a946698-aac263ae-2ed3-4dab-9590-0bc7131214e1-000000@us-west-2.amazonses.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 14:06:20 +02:00
Aloka Dixit
295b02c4be mac80211: Add FILS discovery support
This patch adds mac80211 support to configure FILS discovery
transmission.
Changes include functions to store and retrieve FILS discovery
template, minimum and maximum packet intervals.

Signed-off-by: Aloka Dixit <alokad@codeaurora.org>
Link: https://lore.kernel.org/r/20200805011838.28166-3-alokad@codeaurora.org
[remove SUPPORTS_FILS_DISCOVERY, driver can just set wiphy info]
Link: https://lore.kernel.org/r/010101747a7b3cbb-6edaa89c-436d-4391-8765-61456d7f5f4e-000000@us-west-2.amazonses.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 14:06:20 +02:00
Aloka Dixit
291c49ded2 nl80211: Add FILS discovery support
FILS discovery attribute, NL80211_ATTR_FILS_DISCOVERY, is nested which
supports following parameters as given in IEEE Std 802.11ai-2016,
Annex C.3 MIB detail:
(1) NL80211_FILS_DISCOVERY_ATTR_INT_MIN - Minimum packet interval
(2) NL80211_FILS_DISCOVERY_ATTR_INT_MAX - Maximum packet interval
(3) NL80211_FILS_DISCOVERY_ATTR_TMPL - Template data

Signed-off-by: Aloka Dixit <alokad@codeaurora.org>
Link: https://lore.kernel.org/r/20200805011838.28166-2-alokad@codeaurora.org
[fix attribute and other names, use NLA_RANGE(), use policy only once]
Link: https://lore.kernel.org/r/010101747a7b38a8-306f06b2-9061-4baf-81c1-054a42a18e22-000000@us-west-2.amazonses.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 14:06:19 +02:00
John Crispin
75bcbd6913 mac80211: fix 80 MHz association to 160/80+80 AP on 6 GHz
When trying to associate to an AP support 180 or 80+80 MHz on 6 GHz with a
STA that only has 80 Mhz support the cf2 field inside the chandef will get
set causing the association to fail when trying to validate the chandef.
Fix this by checking the support flags prior to setting cf2.

Fixes: 57fa5e85d5 ("mac80211: determine chandef from HE 6 GHz operation")
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200918115304.1135693-1-john@phrozen.org
[reword commit message a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 14:01:24 +02:00
Felix Fietkau
5595870f77 mac80211: allow bigger A-MSDU sizes in VHT, even if HT is limited
Some APs (e.g. Asus RT-AC88U) have been observed to report an HT MSDU size
limit of 3839 and a VHT limit of 7991. These APs can handle bigger frames
than 3839 bytes just fine, so we should remove the VHT limit based on the
HT capabilities. This improves tx throughput.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200916164611.8022-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 13:06:38 +02:00
Thomas Pedersen
11b34737b1 nl80211: support setting S1G channels
S1G channels have a single width defined per frequency, so
derive it from the channel flags with
ieee80211_s1g_channel_width().

Also support setting an S1G channel where control frequency may
differ from operating, and add some basic validation to
ensure the control channel is with the operating.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200908190323.15814-6-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:31:17 +02:00
Thomas Pedersen
1d47f1198d nl80211: correctly validate S1G beacon head
The S1G beacon has a different header size than regular
beacons, so adjust the beacon head validator.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200908190323.15814-5-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:30:57 +02:00
Thomas Pedersen
68dbad8c65 cfg80211: regulatory: handle S1G channels
S1G channels have a minimum bandwidth of 1Mhz, and there
is a 1:1 mapping of allowed bandwidth to channel number.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200908190323.15814-4-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:29:23 +02:00
Thomas Pedersen
d65a977087 nl80211: advertise supported channel width in S1G
S1G supports 5 channel widths: 1, 2, 4, 8, and 16. One
channel width is allowed per frequency in each operating
class, so it makes more sense to advertise the specific
channel width allowed.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200908190323.15814-3-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:28:06 +02:00
Felix Fietkau
4b7afb52c8 mac80211: reorganize code to remove a forward declaration
Remove the newly added ieee80211_set_vif_encap_ops declaration.
No further code changes.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-15-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:25:21 +02:00
Felix Fietkau
f02dff93e2 mac80211: extend ieee80211_tx_status_ext to support bulk free
Store processed skbs ready to be freed in a list so the driver bulk free them

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-13-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:24:25 +02:00
Felix Fietkau
23e9f1ef16 mac80211: support using ieee80211_tx_status_ext to free skbs without status info
For encap-offloaded packets, ieee80211_free_txskb cannot be used, since it
does not have the vif pointer.
Using ieee80211_tx_status_ext for this purpose has the advantage of being able
avoid an extra station lookup for AQL

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-12-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:22:56 +02:00
Felix Fietkau
a260e49525 mac80211: unify 802.3 (offload) and 802.11 tx status codepath
Make ieee80211_tx_status_8023 call ieee80211_tx_status_ext, similar to
ieee80211_tx_status.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-11-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:20:49 +02:00
Felix Fietkau
9abf4e4983 mac80211: optimize station connection monitor
Calling mod_timer for every rx/tx packet can be quite expensive.
Instead of constantly updating the timer, we can simply let it run out
and check the timestamp of the last ACK or rx packet to re-arm it.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-9-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:19:04 +02:00
Felix Fietkau
1ff4e8f2de mac80211: notify the driver when a sta uses 4-address mode
This is needed for encapsulation offload of 4-address mode packets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-14-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:16:16 +02:00
Felix Fietkau
cc20ff2c6b mac80211: swap NEED_TXPROCESSING and HW_80211_ENCAP tx flags
In order to unify the tx status path, the hw 802.11 encapsulation flag
needs to survive the trip to the tx status call.
Since we don't have any free bits in info->flags, we need to move one.
IEEE80211_TX_INTFL_NEED_TXPROCESSING is only used internally in mac80211,
and only before the call into the driver.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-10-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:13:06 +02:00
Felix Fietkau
c74114d7d5 mac80211: remove tx status call to ieee80211_sta_register_airtime
All drivers using airtime fairness are calling ieee80211_sta_register_airtime
directly, now they must. Document this as well.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-8-nbd@nbd.name
[johannes: update the documentation to suit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:09:25 +02:00
Felix Fietkau
3318111cf6 mac80211: reduce duplication in tx status functions
Move redundant functionality from __ieee80211_tx_status into
ieee80211_tx_status_ext. Preparation for unifying with the 802.3 tx status
codepath.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-7-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:06:38 +02:00
Felix Fietkau
6aea26ce5a mac80211: rework tx encapsulation offload API
The current API (which lets the driver turn on/off per vif directly) has a
number of limitations:
- it does not deal with AP_VLAN
- conditions for enabling (no tkip, no monitor) are only checked at
  add_interface time
- no way to indicate 4-addr support

In order to address this, store offload flags in struct ieee80211_vif
(easy to extend for decap offload later). mac80211 initially sets the enable
flag, but gives the driver a chance to modify it before its settings are
applied. In addition to the .add_interface op, a .update_vif_offload op is
introduced, which can be used for runtime changes.

If a driver can't disable encap offload at runtime, or if it has some extra
limitations, it can simply override the flags within those ops.

Support for encap offload with 4-address mode interfaces can be enabled
by setting a flag from .add_interface or .update_vif_offload.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-6-nbd@nbd.name
[resolved conflict with commit aa2092a9ba ("ath11k: add raw mode and
software crypto support")]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 12:02:57 +02:00
Felix Fietkau
ae04515276 mac80211: set info->control.hw_key for encap offload packets
This is needed for drivers that don't do the key lookup themselves

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-5-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:59:47 +02:00
Felix Fietkau
aea6a3f031 mac80211: skip encap offload for tx multicast/control packets
This simplifies the checks in the encap offload tx handler and allows using
it in cases where software crypto is used for multicast packets, e.g. when
using an AP_VLAN.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-4-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:58:21 +02:00
Felix Fietkau
96ae9cd02c mac80211: check and refresh aggregation session in encap offload tx
Update the last_tx timestamp to avoid tearing down the aggregation session
early. Fall back to the slow path if the session setup is still running

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:55:05 +02:00
Felix Fietkau
5f8d69eaab mac80211: add missing queue/hash initialization to 802.3 xmit
Fixes AQL for encap-offloaded tx

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200908123702.88454-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:54:44 +02:00
YueHaibing
9e0b3bfe5f lib80211: Remove unused macro DRV_NAME
There is no caller in tree any more.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Link: https://lore.kernel.org/r/20200829115506.17828-1-yuehaibing@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:53:00 +02:00
Felix Fietkau
3bd5c7a28a mac80211: do not allow bigger VHT MPDUs than the hardware supports
Limit maximum VHT MPDU size by local capability.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200917125031.45009-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:44:26 +02:00
Johannes Berg
c0de8776af cfg80211: fix 6 GHz channel conversion
We shouldn't accept any channels bigger than 233, fix that.

Reported-by: Amar <asinghal@codeaurora.org>
Fixes: d1a1646c0d ("cfg80211: adapt to new channelization of the 6GHz band")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200917115222.312ba6f1d461.I3a8c8fbcc3cc019814fd9cd0aced7eb591626136@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:43:46 +02:00
Wen Gong
780a8c9efc mac80211: do not disable HE if HT is missing on 2.4 GHz
VHT is not supported on 2.4 GHz, but HE is; don't disable HE if HT
is missing there, do that only on 5 GHz (6 GHz is only HE).

Fixes: 57fa5e85d5 ("mac80211: determine chandef from HE 6 GHz operation")
Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/010101747cb617f2-593c5410-1648-4a42-97a0-f3646a5a6dd1-000000@us-west-2.amazonses.com
[rewrite the commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:41:50 +02:00
Aloka Dixit
412a84b571 mac80211: Fix radiotap header channel flag for 6GHz band
Radiotap header field 'Channel flags' has '2 GHz spectrum' set to
'true' for 6GHz packet.
Change it to 5GHz as there isn't a separate option available for 6GHz.

Signed-off-by: Aloka Dixit <alokad@codeaurora.org>
Link: https://lore.kernel.org/r/010101747ab7b703-1d7c9851-1594-43bf-81f7-f79ce7a67cc6-000000@us-west-2.amazonses.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:40:41 +02:00
Necip Fazil Yildiran
b959ba9f46 lib80211: fix unmet direct dependendices config warning when !CRYPTO
When LIB80211_CRYPT_CCMP is enabled and CRYPTO is disabled, it results in unmet
direct dependencies config warning. The reason is that LIB80211_CRYPT_CCMP
selects CRYPTO_AES and CRYPTO_CCM, which are subordinate to CRYPTO. This is
reproducible with CRYPTO disabled and R8188EU enabled, where R8188EU selects
LIB80211_CRYPT_CCMP but does not select or depend on CRYPTO.

Honor the kconfig menu hierarchy to remove kconfig dependency warnings.

Fixes: a11e2f8548 ("lib80211: use crypto API ccm(aes) transform for CCMP processing")
Signed-off-by: Necip Fazil Yildiran <fazilyildiran@gmail.com>
Link: https://lore.kernel.org/r/20200909095452.3080-1-fazilyildiran@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:37:24 +02:00
Felix Fietkau
8e280369b9 mac80211: add AQL support for VHT160 tx rates
When converting from struct ieee80211_tx_rate to ieee80211_rx_status,
there was one check missing to fill in the bandwidth for 160 MHz

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200915085945.3782-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:36:03 +02:00
Felix Fietkau
1f38b8c564 mac80211: extend AQL aggregation estimation to HE and fix unit mismatch
The unit of the return value of ieee80211_get_rate_duration is nanoseconds, not
microseconds. Adjust the duration checks to account for that.
For higher data rates, allow larger estimated aggregation sizes, and add some
values for HE as well, which can use much larger aggregates.
Since small packets with high data rates can now lead to duration values too
small for info->tx_time_est, return a minimum of 4us.

Fixes: f01cfbaf9b ("mac80211: improve AQL aggregation estimation for low data rates")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200915085945.3782-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-09-18 11:35:53 +02:00
Paolo Abeni
1d39cd8cf7 mptcp: fix integer overflow in mptcp_subflow_discard_data()
Christoph reported an infinite loop in the subflow receive path
under stress condition.

If there are multiple subflows, each of them using a large send
buffer, the delta between the sequence number used by
MPTCP-level retransmission can and the current msk->ack_seq
can be greater than MAX_INT.

In the above scenario, when calling mptcp_subflow_discard_data(),
such delta will be truncated to int, and could result in a negative
number: no bytes will be dropped, and subflow_check_data_avail()
will try again to process the same packet, looping forever.

This change addresses the issue by expanding the 'limit' size to 64
bits, so that overflows are not possible anymore.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/87
Fixes: 6719331c2f ("mptcp: trigger msk processing even for OoO data")
Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 18:04:48 -07:00
Ursula Braun
ac679364b9 net/smc: fix double kfree in smc_listen_work()
If smc_listen_rmda_finish() returns with an error, the storage
addressed by 'buf' is freed a second time.
Consolidate freeing under a common label and jump to that label.

Fixes: 6bb14e48ee ("net/smc: dynamic allocation of CLC proposal buffer")
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 18:03:56 -07:00
Qinglang Miao
2170ff0819 net: hsr: Convert to DEFINE_SHOW_ATTRIBUTE
Use DEFINE_SHOW_ATTRIBUTE macro to simplify the code.

Signed-off-by: Qinglang Miao <miaoqinglang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 17:05:10 -07:00
Yang Yingliang
4d11af5d00 netlink: add spaces around '&' in netlink_recv/sendmsg()
It's hard to read the code without spaces around '&',
for better reading, add spaces around '&'.

Signed-off-by: Yang Yingliang <yangyingliang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 16:53:47 -07:00
Ye Bin
c2ec6bc010 mptcp: Fix unsigned 'max_seq' compared with zero in mptcp_data_queue_ofo
Fixes coccicheck warnig:
net/mptcp/protocol.c:164:11-18: WARNING: Unsigned expression compared with zero: max_seq > 0

Fixes: ab174ad8ef ("mptcp: move ooo skbs into msk out of order queue")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Ye Bin <yebin10@huawei.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 16:45:05 -07:00
Michal Kubecek
19a83d36f9 ethtool: add and use message type for tunnel info reply
Tunnel offload info code uses ETHTOOL_MSG_TUNNEL_INFO_GET message type (cmd
field in genetlink header) for replies to tunnel info netlink request, i.e.
the same value as the request have. This is a problem because we are using
two separate enums for userspace to kernel and kernel to userspace message
types so that this ETHTOOL_MSG_TUNNEL_INFO_GET (28) collides with
ETHTOOL_MSG_CABLE_TEST_TDR_NTF which is what message type 28 means for
kernel to userspace messages.

As the tunnel info request reached mainline in 5.9 merge window, we should
still be able to fix the reply message type without breaking backward
compatibility.

Fixes: c7d759eb7b ("ethtool: add tunnel info interface")
Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 16:43:20 -07:00
Xie He
b79a80bd6d net/packet: Fix a comment about mac_header
1. Change all "dev->hard_header" to "dev->header_ops"

2. On receiving incoming frames when header_ops == NULL:

The comment only says what is wrong, but doesn't say what is right.
This patch changes the comment to make it clear what is right.

3. On transmitting and receiving outgoing frames when header_ops == NULL:

The comment explains that the LL header will be later added by the driver.

However, I think it's better to simply say that the LL header is invisible
to us. This phrasing is better from a software engineering perspective,
because this makes it clear that what happens in the driver should be
hidden from us and we should not care about what happens internally in the
driver.

4. On resuming the LL header (for RAW frames) when header_ops == NULL:

The comment says we are "unlikely" to restore the LL header.

However, we should say that we are "unable" to restore it.
It's not possible (rather than not likely) to restore it, because:

1) There is no way for us to restore because the LL header internally
processed by the driver should be invisible to us.

2) In function packet_rcv and tpacket_rcv, the code only tries to restore
the LL header when header_ops != NULL.

Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-17 16:24:45 -07:00
Karsten Graul
ddcc9b7feb net/smc: check variable before dereferencing in smc_close.c
smc->clcsock and smc->clcsock->sk are used before the check if they can
be dereferenced. Fix this by checking the variables first.

Fixes: a60a2b1e0a ("net/smc: reduce active tcp_listen workers")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-16 17:30:16 -07:00
Nikolay Aleksandrov
d5bf31ddd8 net: bridge: mcast: don't ignore return value of __grp_src_toex_excl
When we're handling TO_EXCLUDE report in EXCLUDE filter mode we should
not ignore the return value of __grp_src_toex_excl() as we'll miss
sending notifications about group changes.

Fixes: 5bf1e00b68 ("net: bridge: mcast: support for IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-16 17:13:25 -07:00
Magnus Karlsson
642e450b6b xsk: Do not discard packet when NETDEV_TX_BUSY
In the skb Tx path, transmission of a packet is performed with
dev_direct_xmit(). When NETDEV_TX_BUSY is set in the drivers, it
signifies that it was not possible to send the packet right now,
please try later. Unfortunately, the xsk transmit code discarded the
packet and returned EBUSY to the application. Fix this unnecessary
packet loss, by not discarding the packet in the Tx ring and return
EAGAIN. As EAGAIN is returned to the application, it can then retry
the send operation later and the packet will then likely be sent as
the driver will then likely have space/resources to send the packet.

In summary, EAGAIN tells the application that the packet was not
discarded from the Tx ring and that it needs to call send()
again. EBUSY, on the other hand, signifies that the packet was not
sent and discarded from the Tx ring. The application needs to put
the packet on the Tx ring again if it wants it to be sent.

Fixes: 35fcde7f8d ("xsk: support for Tx")
Reported-by: Arkadiusz Zema <A.Zema@falconvsystems.com>
Suggested-by: Arkadiusz Zema <A.Zema@falconvsystems.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Link: https://lore.kernel.org/bpf/1600257625-2353-1-git-send-email-magnus.karlsson@gmail.com
2020-09-16 23:36:58 +02:00
David S. Miller
d5d325eae7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-09-15

The following pull-request contains BPF updates for your *net* tree.

We've added 12 non-merge commits during the last 19 day(s) which contain
a total of 10 files changed, 47 insertions(+), 38 deletions(-).

The main changes are:

1) docs/bpf fixes, from Andrii.

2) ld_abs fix, from Daniel.

3) socket casting helpers fix, from Martin.

4) hash iterator fixes, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 19:26:21 -07:00
YiFei Zhu
984fe94f94 bpf: Mutex protect used_maps array and count
To support modifying the used_maps array, we use a mutex to protect
the use of the counter and the array. The mutex is initialized right
after the prog aux is allocated, and destroyed right before prog
aux is freed. This way we guarantee it's initialized for both cBPF
and eBPF.

Signed-off-by: YiFei Zhu <zhuyifei@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Cc: YiFei Zhu <zhuyifei1999@gmail.com>
Link: https://lore.kernel.org/bpf/20200915234543.3220146-2-sdf@google.com
2020-09-15 18:28:27 -07:00
Martin KaFai Lau
8c33dadc3e bpf: Bpf_skc_to_* casting helpers require a NULL check on sk
The bpf_skc_to_* type casting helpers are available to
BPF_PROG_TYPE_TRACING.  The traced PTR_TO_BTF_ID may be NULL.
For example, the skb->sk may be NULL.  Thus, these casting helpers
need to check "!sk" also and this patch fixes them.

Fixes: 0d4fad3e57 ("bpf: Add bpf_skc_to_udp6_sock() helper")
Fixes: 478cfbdf5f ("bpf: Add bpf_skc_to_{tcp, tcp_timewait, tcp_request}_sock() helpers")
Fixes: af7ec13833 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200915182959.241101-1-kafai@fb.com
2020-09-15 18:09:43 -07:00
Ido Schimmel
0695564bb4 nexthop: Only emit a notification when nexthop is actually deleted
Currently, the in-kernel delete notification is emitted from the error
path of nexthop_add() and replace_nexthop(), which can be confusing to
in-kernel listeners as they are not familiar with the nexthop.

Instead, only emit the notification when the nexthop is actually
deleted. The following sub-cases are covered:

1. User space deletes the nexthop
2. The nexthop is deleted by the kernel due to a netdev event (e.g.,
   nexthop device going down)
3. A group is deleted because its last nexthop is being deleted
4. The network namespace of the nexthop device is deleted

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 16:31:25 -07:00
Ido Schimmel
80690ec6b5 nexthop: Convert to blocking notification chain
Currently, the only listener of the nexthop notification chain is the
VXLAN driver. Subsequent patches will add more listeners (e.g., device
drivers such as netdevsim) that need to be able to block when processing
notifications.

Therefore, convert the notification chain to a blocking one. This is
safe as notifications are always emitted from process context.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 16:31:17 -07:00
Jiri Pirko
e2ce94dc1d devlink: introduce the health reporter test command
Introduce a test command for health reporters. User might use this
command to trigger test event on a reporter if the reporter supports it.

Signed-off-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 15:57:16 -07:00
David Ahern
2fbc6e89b2 ipv4: Update exception handling for multipath routes via same device
Kfir reported that pmtu exceptions are not created properly for
deployments where multipath routes use the same device.

After some digging I see 2 compounding problems:
1. ip_route_output_key_hash_rcu is updating the flowi4_oif *after*
   the route lookup. This is the second use case where this has
   been a problem (the first is related to use of vti devices with
   VRF). I can not find any reason for the oif to be changed after the
   lookup; the code goes back to the start of git. It does not seem
   logical so remove it.

2. fib_lookups for exceptions do not call fib_select_path to handle
   multipath route selection based on the hash.

The end result is that the fib_lookup used to add the exception
always creates it based using the first leg of the route.

An example topology showing the problem:

                 |  host1
             +------+
             | eth0 |  .209
             +------+
                 |
             +------+
     switch  | br0  |
             +------+
                 |
       +---------+---------+
       | host2             |  host3
   +------+             +------+
   | eth0 | .250        | eth0 | 192.168.252.252
   +------+             +------+

   +-----+             +-----+
   | vti | .2          | vti | 192.168.247.3
   +-----+             +-----+
       \                  /
 =================================
 tunnels
         192.168.247.1/24

for h in host1 host2 host3; do
        ip netns add ${h}
        ip -netns ${h} link set lo up
        ip netns exec ${h} sysctl -wq net.ipv4.ip_forward=1
done

ip netns add switch
ip -netns switch li set lo up
ip -netns switch link add br0 type bridge stp 0
ip -netns switch link set br0 up

for n in 1 2 3; do
        ip -netns switch link add eth-sw type veth peer name eth-h${n}
        ip -netns switch li set eth-h${n} master br0 up
        ip -netns switch li set eth-sw netns host${n} name eth0
done

ip -netns host1 addr add 192.168.252.209/24 dev eth0
ip -netns host1 link set dev eth0 up
ip -netns host1 route add 192.168.247.0/24 \
        nexthop via 192.168.252.250 dev eth0 nexthop via 192.168.252.252 dev eth0

ip -netns host2 addr add 192.168.252.250/24 dev eth0
ip -netns host2 link set dev eth0 up

ip -netns host2 addr add 192.168.252.252/24 dev eth0
ip -netns host3 link set dev eth0 up

ip netns add tunnel
ip -netns tunnel li set lo up
ip -netns tunnel li add br0 type bridge
ip -netns tunnel li set br0 up
for n in $(seq 11 20); do
        ip -netns tunnel addr add dev br0 192.168.247.${n}/24
done

for n in 2 3
do
        ip -netns tunnel link add vti${n} type veth peer name eth${n}
        ip -netns tunnel link set eth${n} mtu 1360 master br0 up
        ip -netns tunnel link set vti${n} netns host${n} mtu 1360 up
        ip -netns host${n} addr add dev vti${n} 192.168.247.${n}/24
done
ip -netns tunnel ro add default nexthop via 192.168.247.2 nexthop via 192.168.247.3

ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.11
ip netns exec host1 ping -M do -s 1400 -c3 -I 192.168.252.209 192.168.247.15
ip -netns host1 ro ls cache

Before this patch the cache always shows exceptions against the first
leg in the multipath route; 192.168.252.250 per this example. Since the
hash has an initial random seed, you may need to vary the final octet
more than what is listed. In my tests, using addresses between 11 and 19
usually found 1 that used both legs.

With this patch, the cache will have exceptions for both legs.

Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions")
Reported-by: Kfir Itzhak <mastertheknife@gmail.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 15:44:09 -07:00
Lu Wei
2e5117ba9f net: tipc: kerneldoc fixes
Fix parameter description of tipc_link_bc_create()

Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 16ad3f4022 ("tipc: introduce variable window congestion control")
Signed-off-by: Lu Wei <luwei32@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 13:33:04 -07:00
Jakub Kicinski
9a27a33027 ethtool: add standard pause stats
Currently drivers have to report their pause frames statistics
via ethtool -S, and there is a wide variety of names used for
these statistics.

Add the two statistics defined in IEEE 802.3x to the standard
API. Create a new ethtool request header flag for including
statistics in the response to GET commands.

Always create the ETHTOOL_A_PAUSE_STATS nest in replies when
flag is set. Testing if driver declares the op is not a reliable
way of checking if any stats will actually be included and therefore
we don't want to give the impression that presence of
ETHTOOL_A_PAUSE_STATS indicates driver support.

Note that this patch does not include PFC counters, which may fit
better in dcbnl? But mostly I don't need them/have a setup to test
them so I haven't looked deeply into exposing them :)

v3:
 - add a helper for "uninitializing" stats, rather than a cryptic
   memset() (Andrew)

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 13:26:28 -07:00
Alexandra Winter
d05e8e68b0 bridge: Add SWITCHDEV_FDB_FLUSH_TO_BRIDGE notifier
so the switchdev can notifiy the bridge to flush non-permanent fdb entries
for this port. This is useful whenever the hardware fdb of the switchdev
is reset, but the netdev and the bridgeport are not deleted.

Note that this has the same effect as the IFLA_BRPORT_FLUSH attribute.

CC: Jiri Pirko <jiri@resnulli.us>
CC: Ivan Vecera <ivecera@redhat.com>
CC: Roopa Prabhu <roopa@nvidia.com>
CC: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: Alexandra Winter <wintera@linux.ibm.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-15 13:21:47 -07:00
Linus Lüssing
2369e82704 batman-adv: mcast: fix duplicate mcast packets from BLA backbone to mesh
Scenario:
* Multicast frame send from BLA backbone gateways (multiple nodes
  with their bat0 bridged together, with BLA enabled) sharing the same
  LAN to nodes in the mesh

Issue:
* Nodes receive the frame multiple times on bat0 from the mesh,
  once from each foreign BLA backbone gateway which shares the same LAN
  with another

For multicast frames via batman-adv broadcast packets coming from the
same BLA backbone but from different backbone gateways duplicates are
currently detected via a CRC history of previously received packets.

However this CRC so far was not performed for multicast frames received
via batman-adv unicast packets. Fixing this by appyling the same check
for such packets, too.

Room for improvements in the future: Ideally we would introduce the
possibility to not only claim a client, but a complete originator, too.
This would allow us to only send a multicast-in-unicast packet from a BLA
backbone gateway claiming the node and by that avoid potential redundant
transmissions in the first place.

Fixes: 279e89b228 ("batman-adv: add broadcast duplicate check")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-09-15 10:05:24 +02:00
Linus Lüssing
74c09b7275 batman-adv: mcast: fix duplicate mcast packets in BLA backbone from mesh
Scenario:
* Multicast frame send from mesh to a BLA backbone (multiple nodes
  with their bat0 bridged together, with BLA enabled)

Issue:
* BLA backbone nodes receive the frame multiple times on bat0,
  once from mesh->bat0 and once from each backbone_gw from LAN

For unicast, a node will send only to the best backbone gateway
according to the TQ. However for multicast we currently cannot determine
if multiple destination nodes share the same backbone if they don't share
the same backbone with us. So we need to keep sending the unicasts to
all backbone gateways and let the backbone gateways decide which one
will forward the frame. We can use the CLAIM mechanism to make this
decision.

One catch: The batman-adv gateway feature for DHCP packets potentially
sends multicast packets in the same batman-adv unicast header as the
multicast optimizations code. And we are not allowed to drop those even
if we did not claim the source address of the sender, as for such
packets there is only this one multicast-in-unicast packet.

How can we distinguish the two cases?

The gateway feature uses a batman-adv unicast 4 address header. While
the multicast-to-unicasts feature uses a simple, 3 address batman-adv
unicast header. So let's use this to distinguish.

Fixes: fe2da6ff27 ("batman-adv: check incoming packet type for bla")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-09-15 10:05:24 +02:00
Linus Lüssing
3236d215ad batman-adv: mcast: fix duplicate mcast packets in BLA backbone from LAN
Scenario:
* Multicast frame send from a BLA backbone (multiple nodes with
  their bat0 bridged together, with BLA enabled)

Issue:
* BLA backbone nodes receive the frame multiple times on bat0

For multicast frames received via batman-adv broadcast packets the
originator of the broadcast packet is checked before decapsulating and
forwarding the frame to bat0 (batadv_bla_is_backbone_gw()->
batadv_recv_bcast_packet()). If it came from a node which shares the
same BLA backbone with us then it is not forwarded to bat0 to avoid a
loop.

When sending a multicast frame in a non-4-address batman-adv unicast
packet we are currently missing this check - and cannot do so because
the batman-adv unicast packet has no originator address field.

However, we can simply fix this on the sender side by only sending the
multicast frame via unicasts to interested nodes which do not share the
same BLA backbone with us. This also nicely avoids some unnecessary
transmissions on mesh side.

Note that no infinite loop was observed, probably because of dropping
via batadv_interface_tx()->batadv_bla_tx(). However the duplicates still
utterly confuse switches/bridges, ICMPv6 duplicate address detection and
neighbor discovery and therefore leads to long delays before being able
to establish TCP connections, for instance. And it also leads to the Linux
bridge printing messages like:
"br-lan: received packet on eth1 with own address as source address ..."

Fixes: 2d3f6ccc4e ("batman-adv: Modified forwarding behaviour for multicast packets")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-09-15 10:05:24 +02:00
Magnus Karlsson
bf74a370eb xsk: Fix refcount warning in xp_dma_map
Fix a potential refcount warning that a zero value is increased to one
in xp_dma_map, by initializing the refcount to one to start with,
instead of zero plus a refcount_inc().

Fixes: 921b68692a ("xsk: Enable sharing of dma mappings")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/1600095036-23868-1-git-send-email-magnus.karlsson@gmail.com
2020-09-14 18:43:25 -07:00
Björn Töpel
2b1667e54c xsk: Fix number of pinned pages/umem size discrepancy
For AF_XDP sockets, there was a discrepancy between the number of of
pinned pages and the size of the umem region.

The size of the umem region is used to validate the AF_XDP descriptor
addresses. The logic that pinned the pages covered by the region only
took whole pages into consideration, creating a mismatch between the
size and pinned pages. A user could then pass AF_XDP addresses outside
the range of pinned pages, but still within the size of the region,
crashing the kernel.

This change correctly calculates the number of pages to be
pinned. Further, the size check for the aligned mode is
simplified. Now the code simply checks if the size is divisible by the
chunk size.

Fixes: bbff2f321a ("xsk: new descriptor addressing scheme")
Reported-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Ciara Loftus <ciara.loftus@intel.com>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200910075609.7904-1-bjorn.topel@gmail.com
2020-09-14 18:35:09 -07:00
Soheil Hassas Yeganeh
afb83012cc tcp: schedule EPOLLOUT after a partial sendmsg
For EPOLLET, applications must call sendmsg until they get EAGAIN.
Otherwise, there is no guarantee that EPOLLOUT is sent if there was
a failure upon memory allocation.

As a result on high-speed NICs, userspace observes multiple small
sendmsgs after a partial sendmsg until EAGAIN, since TCP can send
1-2 TSOs in between two sendmsg syscalls:

// One large partial send due to memory allocation failure.
sendmsg(20MB)   = 2MB
// Many small sends until EAGAIN.
sendmsg(18MB)   = 64KB
sendmsg(17.9MB) = 128KB
sendmsg(17.8MB) = 64KB
...
sendmsg(...)    = EAGAIN
// At this point, userspace can assume an EPOLLOUT.

To fix this, set the SOCK_NOSPACE on all partial sendmsg scenarios
to guarantee that we send EPOLLOUT after partial sendmsg.

After this commit userspace can assume that it will receive an EPOLLOUT
after the first partial sendmsg. This EPOLLOUT will benefit from
sk_stream_write_space() logic delaying the EPOLLOUT until significant
space is available in write queue.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:58:24 -07:00
Soheil Hassas Yeganeh
8ba3c9d1c6 tcp: return EPOLLOUT from tcp_poll only when notsent_bytes is half the limit
If there was any event available on the TCP socket, tcp_poll()
will be called to retrieve all the events.  In tcp_poll(), we call
sk_stream_is_writeable() which returns true as long as we are at least
one byte below notsent_lowat.  This will result in quite a few
spurious EPLLOUT and frequent tiny sendmsg() calls as a result.

Similar to sk_stream_write_space(), use __sk_stream_is_writeable
with a wake value of 1, so that we set EPOLLOUT only if half the
space is available for write.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:58:24 -07:00
Xin Long
8e1b3ac478 net: sched: initialize with 0 before setting erspan md->u
In fl_set_erspan_opt(), all bits of erspan md was set 1, as this
function is also used to set opt MASK. However, when setting for
md->u.index for opt VALUE, the rest bits of the union md->u will
be left 1. It would cause to fail the match of the whole md when
version is 1 and only index is set.

This patch is to fix by initializing with 0 before setting erspan
md->u.

Reported-by: Shuang Li <shuali@redhat.com>
Fixes: 79b1011cb3 ("net: sched: allow flower to match erspan options")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:53:38 -07:00
Xin Long
681d2cfb79 lwtunnel: only keep the available bits when setting vxlan md->gbp
As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata
on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can
be set to or parse from the packet for vxlan gbp option.

So do the mask when set it in lwtunnel, as it does in act_tunnel_key and
cls_flower.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:49:39 -07:00
Xin Long
13e6ce98aa net: sched: only keep the available bits when setting vxlan md->gbp
As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata
on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can
be set to or parse from the packet for vxlan gbp option.

So we'd better do the mask when set it in act_tunnel_key and cls_flower.
Otherwise, when users don't know these bits, they may configure with a
value which can never be matched.

Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:49:39 -07:00
Xin Long
ff48b6222e tipc: use skb_unshare() instead in tipc_buf_append()
In tipc_buf_append() it may change skb's frag_list, and it causes
problems when this skb is cloned. skb_unclone() doesn't really
make this skb's flag_list available to change.

Shuang Li has reported an use-after-free issue because of this
when creating quite a few macvlan dev over the same dev, where
the broadcast packets will be cloned and go up to the stack:

 [ ] BUG: KASAN: use-after-free in pskb_expand_head+0x86d/0xea0
 [ ] Call Trace:
 [ ]  dump_stack+0x7c/0xb0
 [ ]  print_address_description.constprop.7+0x1a/0x220
 [ ]  kasan_report.cold.10+0x37/0x7c
 [ ]  check_memory_region+0x183/0x1e0
 [ ]  pskb_expand_head+0x86d/0xea0
 [ ]  process_backlog+0x1df/0x660
 [ ]  net_rx_action+0x3b4/0xc90
 [ ]
 [ ] Allocated by task 1786:
 [ ]  kmem_cache_alloc+0xbf/0x220
 [ ]  skb_clone+0x10a/0x300
 [ ]  macvlan_broadcast+0x2f6/0x590 [macvlan]
 [ ]  macvlan_process_broadcast+0x37c/0x516 [macvlan]
 [ ]  process_one_work+0x66a/0x1060
 [ ]  worker_thread+0x87/0xb10
 [ ]
 [ ] Freed by task 3253:
 [ ]  kmem_cache_free+0x82/0x2a0
 [ ]  skb_release_data+0x2c3/0x6e0
 [ ]  kfree_skb+0x78/0x1d0
 [ ]  tipc_recvmsg+0x3be/0xa40 [tipc]

So fix it by using skb_unshare() instead, which would create a new
skb for the cloned frag and it'll be safe to change its frag_list.
The similar things were also done in sctp_make_reassembled_event(),
which is using skb_copy().

Reported-by: Shuang Li <shuali@redhat.com>
Fixes: 37e22164a8 ("tipc: rename and move message reassembly function")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:39:58 -07:00
Peilin Ye
bb3a420d47 tipc: Fix memory leak in tipc_group_create_member()
tipc_group_add_to_tree() returns silently if `key` matches `nkey` of an
existing node, causing tipc_group_create_member() to leak memory. Let
tipc_group_add_to_tree() return an error in such a case, so that
tipc_group_create_member() can handle it properly.

Fixes: 75da2163db ("tipc: introduce communication groups")
Reported-and-tested-by: syzbot+f95d90c454864b3b5bc9@syzkaller.appspotmail.com
Cc: Hillf Danton <hdanton@sina.com>
Link: https://syzkaller.appspot.com/bug?id=048390604fe1b60df34150265479202f10e13aff
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:36:20 -07:00
Vladimir Oltean
b14a9fc452 __netif_receive_skb_core: don't untag vlan from skb on DSA master
A DSA master interface has upper network devices, each representing an
Ethernet switch port attached to it. Demultiplexing the source ports and
setting skb->dev accordingly is done through the catch-all ETH_P_XDSA
packet_type handler. Catch-all because DSA vendors have various header
implementations, which can be placed anywhere in the frame: before the
DMAC, before the EtherType, before the FCS, etc. So, the ETH_P_XDSA
handler acts like an rx_handler more than anything.

It is unlikely for the DSA master interface to have any other upper than
the DSA switch interfaces themselves. Only maybe a bridge upper*, but it
is very likely that the DSA master will have no 8021q upper. So
__netif_receive_skb_core() will try to untag the VLAN, despite the fact
that the DSA switch interface might have an 8021q upper. So the skb will
never reach that.

So far, this hasn't been a problem because most of the possible
placements of the DSA switch header mentioned in the first paragraph
will displace the VLAN header when the DSA master receives the frame, so
__netif_receive_skb_core() will not actually execute any VLAN-specific
code for it. This only becomes a problem when the DSA switch header does
not displace the VLAN header (for example with a tail tag).

What the patch does is it bypasses the untagging of the skb when there
is a DSA switch attached to this net device. So, DSA is the only
packet_type handler which requires seeing the VLAN header. Once skb->dev
will be changed, __netif_receive_skb_core() will be invoked again and
untagging, or delivery to an 8021q upper, will happen in the RX of the
DSA switch interface itself.

*see commit 9eb8eff0cf ("net: bridge: allow enslaving some DSA master
network devices". This is actually the reason why I prefer keeping DSA
as a packet_type handler of ETH_P_XDSA rather than converting to an
rx_handler. Currently the rx_handler code doesn't support chaining, and
this is a problem because a DSA master might be bridged.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 16:34:18 -07:00
David Ahern
1869e226a7 ipv4: Initialize flowi4_multipath_hash in data path
flowi4_multipath_hash was added by the commit referenced below for
tunnels. Unfortunately, the patch did not initialize the new field
for several fast path lookups that do not initialize the entire flow
struct to 0. Fix those locations. Currently, flowi4_multipath_hash
is random garbage and affects the hash value computed by
fib_multipath_hash for multipath selection.

Fixes: 24ba14406c ("route: Add multipath_hash in flowi_common to make user-define hash")
Signed-off-by: David Ahern <dsahern@gmail.com>
Cc: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 14:54:56 -07:00
Paolo Abeni
2de79ee27f net: try to avoid unneeded backlog flush
flush_all_backlogs() may cause deadlock on systems
running processes with FIFO scheduling policy.

The above is critical in -RT scenarios, where user-space
specifically ensure no network activity is scheduled on
the CPU running the mentioned FIFO process, but still get
stuck.

This commit tries to address the problem checking the
backlog status on the remote CPUs before scheduling the
flush operation. If the backlog is empty, we can skip it.

v1 -> v2:
 - explicitly clear flushed cpu mask - Eric

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 14:39:00 -07:00
David S. Miller
e0d9ae699e RxRPC development fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl9fipcACgkQ+7dXa6fL
 C2sNIhAAjnqKckjbLtzy2ZhO3nyEMlABYtGcDi8a1x3H42Ncsqca5GiKjjY54n90
 rLe2iyX/5ncURjrkVVUJFTlkhQrha40dOp/DYHHbwj4ko9P625QrsPn0h5zo/Ben
 UUeOVqibyAOoqXWCqhRgLF1BhPmg/22TtHiqbcRul+nss9vcjuFcjOEIhNVZDUfu
 VPjeitxF9Tuz9FEH00UJs23LWONBCvNWDtCjAj/hf328Mk+TptSiFPTNVEuPrbje
 1IbBy3PjBzeL2CFtp0OQs3uibAz+7C9IY4i53tdBPQNE5uW1FE/Wm7ixK3Oseq8X
 hkAP3phNG669tZzE+49g0X1AfqHJr9F0dGbdIqOYC4seyC6NXROuvnzX3HdV7gYd
 MwCyIcjWxw2B6dhjk2sDncFjr7Tima6KRvWHsf8cEk645gbMEltvNxJi1KCK/sj/
 wpiiQrPZZ82e+RfIfQ5l5cuMEROceZ1LpUKRK5rc4Gc49xuFbanoOYh4iBChmABb
 ULKVRHb/HFRIY9Y8boxw+0iDzDYQugoH6IsEEBdH87UBonEfPaJpcRTljcFU4LVh
 ppNeOXFu0p+CQwDaLDhTILDVoFDjMfVAjOC42TMfiTLEarWz5cpRPu96tOerpSgk
 Ulmh6m2cGNYDOIuCdVyRJFf5F9+Mj3VIBygven4GuWUqkZ18ooc=
 =0qvR
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20200914' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fixes for the connection manager rewrite

Here are some fixes for the connection manager rewrite:

 (1) Fix a goto to the wrong place in error handling.

 (2) Fix a missing NULL pointer check.

 (3) The stored allocation error needs to be stored signed.

 (4) Fix a leak of connection bundle when clearing connections due to
     net namespace exit.

 (5) Fix an overget of the bundle when setting up a new client conn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 14:03:38 -07:00
Eric Dumazet
0cbe6a8f08 tcp: remove SOCK_QUEUE_SHRUNK
SOCK_QUEUE_SHRUNK is currently used by TCP as a temporary state
that remembers if some room has been made in the rtx queue
by an incoming ACK packet.

This is later used from tcp_check_space() before
considering to send EPOLLOUT.

Problem is: If we receive SACK packets, and no packet
is removed from RTX queue, we can send fresh packets, thus
moving them from write queue to rtx queue and eventually
empty the write queue.

This stall can happen if TCP_NOTSENT_LOWAT is used.

With this fix, we no longer risk stalling sends while holes
are repaired, and we can fully use socket sndbuf.

This also removes a cache line dirtying for typical RPC
workloads.

Fixes: c9bee3b7fd ("tcp: TCP_NOTSENT_LOWAT socket option")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:36:00 -07:00
Xie He
b4c5881446 net/packet: Fix a comment about hard_header_len and headroom allocation
This comment is outdated and no longer reflects the actual implementation
of af_packet.c.

Reasons for the new comment:

1.

In af_packet.c, the function packet_snd first reserves a headroom of
length (dev->hard_header_len + dev->needed_headroom).
Then if the socket is a SOCK_DGRAM socket, it calls dev_hard_header,
which calls dev->header_ops->create, to create the link layer header.
If the socket is a SOCK_RAW socket, it "un-reserves" a headroom of
length (dev->hard_header_len), and checks if the user has provided a
header sized between (dev->min_header_len) and (dev->hard_header_len)
(in dev_validate_header).
This shows the developers of af_packet.c expect hard_header_len to
be consistent with header_ops.

2.

In af_packet.c, the function packet_sendmsg_spkt has a FIXME comment.
That comment states that prepending an LL header internally in a driver
is considered a bug. I believe this bug can be fixed by setting
hard_header_len to 0, making the internal header completely invisible
to af_packet.c (and requesting the headroom in needed_headroom instead).

3.

There is a commit for a WiFi driver:
commit 9454f7a895 ("mwifiex: set needed_headroom, not hard_header_len")
According to the discussion about it at:
  https://patchwork.kernel.org/patch/11407493/
The author tried to set the WiFi driver's hard_header_len to the Ethernet
header length, and request additional header space internally needed by
setting needed_headroom.
This means this usage is already adopted by driver developers.

Cc: Willem de Bruijn <willemdebruijn.kernel@gmail.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Brian Norris <briannorris@chromium.org>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Xie He <xie.he.0141@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:34:39 -07:00
Paolo Abeni
c76c695656 mptcp: call tcp_cleanup_rbuf on subflows
That is needed to let the subflows announce promptly when new
space is available in the receive buffer.

tcp_cleanup_rbuf() is currently a static function, drop the
scope modifier and add a declaration in the TCP header.

Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
d5f49190de mptcp: allow picking different xmit subflows
Update the scheduler to less trivial heuristic: cache
the last used subflow, and try to send on it a reasonably
long burst of data.

When the burst or the subflow send space is exhausted, pick
the subflow with the lower ratio between write space and
send buffer - that is, the subflow with the greater relative
amount of free space.

v1 -> v2:
 - fix 32 bit build breakage due to 64bits div
 - fix checkpath issues (uint64_t -> u64)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
4596a2c1b7 mptcp: allow creating non-backup subflows
Currently the 'backup' attribute of local endpoint
is ignored. Let's use it for the MP_JOIN handshake

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
ef0da3b8a2 mptcp: move address attribute into mptcp_addr_info
So that can be accessed easily from the subflow creation
helper. No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
06242e44b9 mptcp: add OoO related mibs
Add a bunch of MPTCP mibs related to MPTCP OoO data
processing.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
04e4cd4f7c mptcp: cleanup mptcp_subflow_discard_data()
There is no need to use the tcp_read_sock(), we can
simply drop the skb. Additionally try to look at the
next buffer for in order data.

This both simplifies the code and avoid unneeded indirect
calls.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
ab174ad8ef mptcp: move ooo skbs into msk out of order queue.
Add an RB-tree to cope with OoO (at MPTCP level) data.
__mptcp_move_skb() insert into the RB tree "future"
data, eventually coalescing skb as allowed by the
MPTCP DSN.

To simplify sequence accounting, move the DSN inside
the cb.

After successfully enqueuing in sequence data, check
if we can use any data from the RB tree.

Additionally move the data_fin check after spooling
data from the OoO tree, otherwise we could miss shutdown
events.

The RB tree code is copied as verbatim as possible
from tcp_data_queue_ofo(), with a few simplifications
due to the fact that MPTCP doesn't need to cope with
sacks. All bugs here are added by me.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
8268ed4c9d mptcp: introduce and use mptcp_try_coalesce()
Factor-out existing code, will be re-used by the
next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
da51aef5fe mptcp: basic sndbuf autotuning
Let the msk sendbuf track the size of the larger subflow's
send window, so that we ensure mptcp_sendmsg() does not
exceed MPTCP-level send window.

The update is performed just before try to send any data.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
6719331c2f mptcp: trigger msk processing even for OoO data
This is a prerequisite to allow receiving data from multiple
subflows without re-injection.

Instead of dropping the OoO - "future" data in
subflow_check_data_avail(), call into __mptcp_move_skbs()
and let the msk drop that.

To avoid code duplication factor out the mptcp_subflow_discard_data()
helper.

Note that __mptcp_move_skbs() can now find multiple subflows
with data avail (comprising to-be-discarded data), so must
update the byte counter incrementally.

v1 -> v2:
 - fix checkpatch issues (unsigned -> unsigned int)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
47bebdf365 mptcp: set data_ready status bit in subflow_check_data_avail()
This simplify mptcp_subflow_data_available() and will
made follow-up patches simpler.

Additionally remove the unneeded checks on subflow copied_seq:
we always whole skbs out of subflows.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Paolo Abeni
63561a403c mptcp: rethink 'is writable' conditional
Currently, when checking for the 'msk is writable' condition, we
look at the individual subflows write space.
That works well while we send data via a single subflow, but will
not as soon as we will enable concurrent xmit on multiple subflows.

With this change msk becomes writable when the following conditions
hold:
- the socket has some free write space
- there is at least a subflow with write free space

Additionally we need to set the NOSPACE bit on all subflows
before blocking.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-14 13:28:02 -07:00
Sven Eckelmann
4bba9dab86 batman-adv: Add missing include for in_interrupt()
The fix for receiving (internally generated) bla packets outside the
interrupt context introduced the usage of in_interrupt(). But this
functionality is only defined in linux/preempt.h which was not included
with the same patch.

Fixes: 279e89b228 ("batman-adv: bla: use netif_rx_ni when not in interrupt context")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-09-14 22:23:24 +02:00
Daniel Winkler
53274477ac Bluetooth: pause/resume advertising around suspend
Currently, the controller will continue advertising when the system
enters suspend. This patch makes sure that all advertising instances are
paused when entering suspend, and resumed when suspend exits.

The Advertising and Suspend/Resume test suites were both run on this
change on 4.19 kernel with both hardware offloaded multi-advertising and
software rotated multi-advertising. In addition, a new test was added
that performs the following steps:
* Register 3 advertisements via bluez RegisterAdvertisement
* Verify reception of all advertisements by remote peer
* Enter suspend on DUT
* Verify failure to receive all advertisements by remote peer
* Exit suspend on DUT
* Verify reception of all advertisements by remote peer

Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-14 20:22:40 +02:00
David Howells
f1b4497487 rxrpc: Fix an overget of the conn bundle when setting up a client conn
When setting up a client connection, a second ref is accidentally obtained
on the connection bundle (we get one when allocating the conn and a second
one when adding the conn to the bundle).

Fix it to only use the ref obtained by rxrpc_alloc_client_connection() and
not to add a second when adding the candidate conn to the bundle.

Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-14 16:18:59 +01:00
David Howells
546a42410b rxrpc: Fix conn bundle leak in net-namespace exit
When the network namespace exits, rxrpc_clean_up_local_conns() needs to
unbundle each client connection it evicts.  Fix it to do this.

kernel BUG at net/rxrpc/conn_object.c:481!
RIP: 0010:rxrpc_destroy_all_connections.cold+0x11/0x13 net/rxrpc/conn_object.c:481
Call Trace:
 rxrpc_exit_net+0x1a4/0x2e0 net/rxrpc/net_ns.c:119
 ops_exit_list+0xb0/0x160 net/core/net_namespace.c:186
 cleanup_net+0x4ea/0xa00 net/core/net_namespace.c:603
 process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
 worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
 kthread+0x3b5/0x4a0 kernel/kthread.c:292
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:294

Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Reported-by: syzbot+52071f826a617b9c76ed@syzkaller.appspotmail.com
Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-14 16:18:59 +01:00
David Howells
8806245a3e rxrpc: Fix rxrpc_bundle::alloc_error to be signed
The alloc_error field in the rxrpc_bundle struct should be signed as it has
negative error codes assigned to it.  Checks directly on it may then fail,
and may produce a warning like this:

	net/rxrpc/conn_client.c:662 rxrpc_wait_for_channel()
	warn: 'bundle->alloc_error' is unsigned

Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Reported-by Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-14 16:18:59 +01:00
David Howells
456b2f2dc7 rxrpc: Fix an error goto in rxrpc_connect_call()
Fix an error-handling goto in rxrpc_connect_call() whereby it will jump to
free the bundle it failed to allocate.

Fixes: 245500d853 ("rxrpc: Rewrite the client connection manager")
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-14 12:58:17 +01:00
Abhishek Pandit-Subedi
2f20216c1d Bluetooth: Emit controller suspend and resume events
Emit controller suspend and resume events when we are ready for suspend
and we've resumed from suspend.

The controller suspend event will report whatever suspend state was
successfully entered. The controller resume event will check the first
HCI event that was received after we finished preparing for suspend and,
if it was a connection event, store the address of the peer that caused
the event. If it was not a connection event, we mark the wake reason as
an unexpected event.

Here is a sample btmon trace with these events:

@ MGMT Event: Controller Suspended (0x002d) plen 1
        Suspend state: Page scanning and/or passive scanning (2)

@ MGMT Event: Controller Resumed (0x002e) plen 8
        Wake reason: Remote wake due to peer device connection (2)
        LE Address: CD:F3:CD:13:C5:9A (OUI CD-F3-CD)

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-13 09:55:26 +02:00
Abhishek Pandit-Subedi
f0cfc486f7 Bluetooth: Add suspend reason for device disconnect
Update device disconnect event with reason 0x5 to indicate that device
disconnected because the controller is suspending.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-13 09:54:28 +02:00
Abhishek Pandit-Subedi
346ce5b7d6 Bluetooth: Add mgmt suspend and resume events
Add the controller suspend and resume events, which will signal when
Bluetooth has completed preparing for suspend and when it's ready for
resume.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Miao-chen Chou <mcchou@chromium.org>
Reviewed-by: Sonny Sasaka <sonnysasaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-13 09:54:00 +02:00
Abhishek Pandit-Subedi
6baf8a6afc Bluetooth: Set ext scan response only when it exists
Only set extended scan response only when it exists. Otherwise, clear
the scan response data.

Per the core spec v5.2, Vol 4, Part E, 7.8.55

If the advertising set is non-scannable and the Host uses this command
other than to discard existing data, the Controller shall return the
error code Invalid HCI Command Parameters (0x12).

On WCN3991, the controller correctly responds with Invalid Parameters
when this is sent.  That error causes __hci_req_hci_power_on to fail
with -EINVAL and LE devices can't connect because background scanning
isn't configured.

Here is an hci trace of where this issue occurs during power on:

< HCI Command: LE Set Extended Advertising Parameters (0x08|0x0036) plen 25
        Handle: 0x00
        Properties: 0x0010
          Use legacy advertising PDUs: ADV_NONCONN_IND
        Min advertising interval: 181.250 msec (0x0122)
        Max advertising interval: 181.250 msec (0x0122)
        Channel map: 37, 38, 39 (0x07)
        Own address type: Random (0x01)
        Peer address type: Public (0x00)
        Peer address: 00:00:00:00:00:00 (OUI 00-00-00)
        Filter policy: Allow Scan Request from Any, Allow Connect...
        TX power: 127 dbm (0x7f)
        Primary PHY: LE 1M (0x01)
        Secondary max skip: 0x00
        Secondary PHY: LE 1M (0x01)
        SID: 0x00
        Scan request notifications: Disabled (0x00)
> HCI Event: Command Complete (0x0e) plen 5
      LE Set Extended Advertising Parameters (0x08|0x0036) ncmd 1
        Status: Success (0x00)
        TX power (selected): 9 dbm (0x09)
< HCI Command: LE Set Advertising Set Random Address (0x08|0x0035) plen 7
        Advertising handle: 0x00
        Advertising random address: 08:FD:55:ED:22:28 (OUI 08-FD-55)
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Advertising Set Random Address (0x08|0x0035) ncmd
        Status: Success (0x00)
< HCI Command: LE Set Extended Scan Response Data (0x08|0x0038) plen 35
        Handle: 0x00
        Operation: Complete scan response data (0x03)
        Fragment preference: Minimize fragmentation (0x01)
        Data length: 0x0d
        Name (short): Chromebook
> HCI Event: Command Complete (0x0e) plen 4
      LE Set Extended Scan Response Data (0x08|0x0038) ncmd 1
        Status: Invalid HCI Command Parameters (0x12)

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-13 09:47:38 +02:00
Abhishek Pandit-Subedi
20ae4089d0 Bluetooth: Only mark socket zapped after unlocking
Since l2cap_sock_teardown_cb doesn't acquire the channel lock before
setting the socket as zapped, it could potentially race with
l2cap_sock_release which frees the socket. Thus, wait until the cleanup
is complete before marking the socket as zapped.

This race was reproduced on a JBL GO speaker after the remote device
rejected L2CAP connection due to resource unavailability.

Here is a dmesg log with debug logs from a repro of this bug:
[ 3465.424086] Bluetooth: hci_core.c:hci_acldata_packet() hci0 len 16 handle 0x0003 flags 0x0002
[ 3465.424090] Bluetooth: hci_conn.c:hci_conn_enter_active_mode() hcon 00000000cfedd07d mode 0
[ 3465.424094] Bluetooth: l2cap_core.c:l2cap_recv_acldata() conn 000000007eae8952 len 16 flags 0x2
[ 3465.424098] Bluetooth: l2cap_core.c:l2cap_recv_frame() len 12, cid 0x0001
[ 3465.424102] Bluetooth: l2cap_core.c:l2cap_raw_recv() conn 000000007eae8952
[ 3465.424175] Bluetooth: l2cap_core.c:l2cap_sig_channel() code 0x03 len 8 id 0x0c
[ 3465.424180] Bluetooth: l2cap_core.c:l2cap_connect_create_rsp() dcid 0x0045 scid 0x0000 result 0x02 status 0x00
[ 3465.424189] Bluetooth: l2cap_core.c:l2cap_chan_put() chan 000000006acf9bff orig refcnt 4
[ 3465.424196] Bluetooth: l2cap_core.c:l2cap_chan_del() chan 000000006acf9bff, conn 000000007eae8952, err 111, state BT_CONNECT
[ 3465.424203] Bluetooth: l2cap_sock.c:l2cap_sock_teardown_cb() chan 000000006acf9bff state BT_CONNECT
[ 3465.424221] Bluetooth: l2cap_core.c:l2cap_chan_put() chan 000000006acf9bff orig refcnt 3
[ 3465.424226] Bluetooth: hci_core.h:hci_conn_drop() hcon 00000000cfedd07d orig refcnt 6
[ 3465.424234] BUG: spinlock bad magic on CPU#2, kworker/u17:0/159
[ 3465.425626] Bluetooth: hci_sock.c:hci_sock_sendmsg() sock 000000002bb0cb64 sk 00000000a7964053
[ 3465.430330]  lock: 0xffffff804410aac0, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
[ 3465.430332] Causing a watchdog bite!

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reported-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Reviewed-by: Manish Mandlik <mmandlik@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-13 09:46:00 +02:00
Vladimir Oltean
88236591ec Revert "net: dsa: Add more convenient functions for installing port VLANs"
This reverts commit 314f76d7a6.

Citing that commit message, the call graph was:

    dsa_slave_vlan_rx_add_vid   dsa_port_setup_8021q_tagging
                |                        |
                |                        |
                |          +-------------+
                |          |
                v          v
               dsa_port_vid_add      dsa_slave_port_obj_add
                      |                         |
                      +-------+         +-------+
                              |         |
                              v         v
                           dsa_port_vlan_add

Now that tag_8021q has its own ops structure, it no longer relies on
dsa_port_vid_add, and therefore on the dsa_switch_ops to install its
VLANs.

So dsa_port_vid_add now only has one single caller. So we can simplify
the call graph to what it was before, aka:

        dsa_slave_vlan_rx_add_vid     dsa_slave_port_obj_add
                      |                         |
                      +-------+         +-------+
                              |         |
                              v         v
                           dsa_port_vlan_add

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11 17:30:43 -07:00
Vladimir Oltean
5899ee367a net: dsa: tag_8021q: add a context structure
While working on another tag_8021q driver implementation, some things
became apparent:

- It is not mandatory for a DSA driver to offload the tag_8021q VLANs by
  using the VLAN table per se. For example, it can add custom TCAM rules
  that simply encapsulate RX traffic, and redirect & decapsulate rules
  for TX traffic. For such a driver, it makes no sense to receive the
  tag_8021q configuration through the same callback as it receives the
  VLAN configuration from the bridge and the 8021q modules.

- Currently, sja1105 (the only tag_8021q user) sets a
  priv->expect_dsa_8021q variable to distinguish between the bridge
  calling, and tag_8021q calling. That can be improved, to say the
  least.

- The crosschip bridging operations are, in fact, stateful already. The
  list of crosschip_links must be kept by the caller and passed to the
  relevant tag_8021q functions.

So it would be nice if the tag_8021q configuration was more
self-contained. This patch attempts to do that.

Create a struct dsa_8021q_context which encapsulates a struct
dsa_switch, and has 2 function pointers for adding and deleting a VLAN.
These will replace the previous channel to the driver, which was through
the .port_vlan_add and .port_vlan_del callbacks of dsa_switch_ops.

Also put the list of crosschip_links into this dsa_8021q_context.
Drivers that don't support cross-chip bridging can simply omit to
initialize this list, as long as they dont call any cross-chip function.

The sja1105_vlan_add and sja1105_vlan_del functions are refactored into
a smaller sja1105_vlan_add_one, which now has 2 entry points:
- sja1105_vlan_add, from struct dsa_switch_ops
- sja1105_dsa_8021q_vlan_add, from the tag_8021q ops
But even this change is fairly trivial. It just reflects the fact that
for sja1105, the VLANs from these 2 channels end up in the same hardware
table. However that is not necessarily true in the general sense (and
that's the reason for making this change).

The rest of the patch is mostly plain refactoring of "ds" -> "ctx". The
dsa_8021q_context structure needs to be propagated because adding a VLAN
is now done through the ops function pointers inside of it.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11 17:30:43 -07:00
Vladimir Oltean
7e092af2f3 net: dsa: tag_8021q: setup tagging via a single function call
There is no point in calling dsa_port_setup_8021q_tagging for each
individual port. Additionally, it will become more difficult to do that
when we'll have a context structure to tag_8021q (next patch). So
refactor this now.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11 17:30:43 -07:00
Vinicius Costa Gomes
b5b73b26b3 taprio: Fix allowing too small intervals
It's possible that the user specifies an interval that couldn't allow
any packet to be transmitted. This also avoids the issue of the
hrtimer handler starving the other threads because it's running too
often.

The solution is to reject interval sizes that according to the current
link speed wouldn't allow any packet to be transmitted.

Reported-by: syzbot+8267241609ae8c23b248@syzkaller.appspotmail.com
Fixes: 5a781ccbd1 ("tc: Add support for configuring the taprio scheduler")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11 17:21:51 -07:00
Ido Schimmel
12913f7459 bridge: mcast: Fix incomplete MDB dump
Each MDB entry is encoded in a nested netlink attribute called
'MDBA_MDB_ENTRY'. In turn, this attribute contains another nested
attributed called 'MDBA_MDB_ENTRY_INFO', which encodes a single port
group entry within the MDB entry.

The cited commit added the ability to restart a dump from a specific
port group entry. However, on failure to add a port group entry to the
dump the entire MDB entry (stored in 'nest2') is removed, resulting in
missing port group entries.

Fix this by finalizing the MDB entry with the partial list of already
encoded port group entries.

Fixes: 5205e919c9 ("net: bridge: mcast: add support for src list and filter mode dumping")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Acked-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11 14:49:47 -07:00
Colin Ian King
2291267e48 ipv6: remove redundant assignment to variable err
The variable err is being initialized with a value that is never read and
it is being updated later with a new value. The initialization is redundant
and can be removed.  Also re-order variable declarations in reverse
Christmas tree ordering.

Addresses-Coverity: ("Unused value")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-11 14:48:25 -07:00
Daniel Winkler
d5ea32da87 Bluetooth: Add MGMT capability flags for tx power and ext advertising
For new advertising features, it will be important for userspace to
know the capabilities of the controller and kernel. If the controller
and kernel support extended advertising, we include flags indicating
hardware offloading support and support for setting tx power of adv
instances.

In the future, vendor-specific commands may allow the setting of tx
power in advertising instances, but for now this feature is only
marked available if extended advertising is supported.

This change is manually verified in userspace by ensuring the
advertising manager's supported_flags field is updated with new flags on
hatch chromebook (ext advertising supported).

Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-11 09:17:26 +02:00
Daniel Winkler
87597482c6 Bluetooth: Report num supported adv instances for hw offloading
Here we make sure we properly report the number of supported
advertising slots when we are using hardware offloading. If no
hardware offloading is available, we default this value to
HCI_MAX_ADV_INSTANCES for use in software rotation as before.

This change has been tested on kukui (no ext adv) and hatch (ext adv)
chromebooks by verifying "SupportedInstances" shows 5 (the default) and
6 (slots supported by controller), respectively.

Signed-off-by: Daniel Winkler <danielwinkler@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-11 09:17:25 +02:00
Joseph Hwang
0fc1a726f8 Bluetooth: sco: new getsockopt options BT_SNDMTU/BT_RCVMTU
This patch defines new getsockopt options BT_SNDMTU/BT_RCVMTU
for SCO socket to be compatible with other bluetooth sockets.
These new options return the same value as option SCO_OPTIONS
which is already present on existing kernels.

Signed-off-by: Joseph Hwang <josephsih@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Reviewed-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Pali Rohár <pali@kernel.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-11 09:07:56 +02:00
Abhishek Pandit-Subedi
3eec158d5e Bluetooth: Re-order clearing suspend tasks
Unregister_pm_notifier is a blocking call so suspend tasks should be
cleared beforehand. Otherwise, the notifier will wait for completion
before returning (and we encounter a 2s timeout on resume).

Fixes: 0e9952804e (Bluetooth: Clear suspend tasks on unregister)
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-11 09:05:08 +02:00
Peilin Ye
cafd472a10 Bluetooth: Fix memory leak in read_adv_mon_features()
read_adv_mon_features() is leaking memory. Free `rp` before returning.

Fixes: e5e1e7fd47 ("Bluetooth: Add handler of MGMT_OP_READ_ADV_MONITOR_FEATURES")
Reported-and-tested-by: syzbot+f7f6e564f4202d8601c6@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=f7f6e564f4202d8601c6
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-11 09:00:53 +02:00
Ard Biesheuvel
e33d2a7b30 SUNRPC: remove RC4-HMAC-MD5 support from KerberosV
The RC4-HMAC-MD5 KerberosV algorithm is based on RFC 4757 [0], which
was specifically issued for interoperability with Windows 2000, but was
never intended to receive the same level of support. The RFC says

  The IETF Kerberos community supports publishing this specification as
  an informational document in order to describe this widely
  implemented technology.  However, while these encryption types
  provide the operations necessary to implement the base Kerberos
  specification [RFC4120], they do not provide all the required
  operations in the Kerberos cryptography framework [RFC3961].  As a
  result, it is not generally possible to implement potential
  extensions to Kerberos using these encryption types.  The Kerberos
  encryption type negotiation mechanism [RFC4537] provides one approach
  for using such extensions even when a Kerberos infrastructure uses
  long-term RC4 keys.  Because this specification does not implement
  operations required by RFC 3961 and because of security concerns with
  the use of RC4 and MD4 discussed in Section 8, this specification is
  not appropriate for publication on the standards track.

  The RC4-HMAC encryption types are used to ease upgrade of existing
  Windows NT environments, provide strong cryptography (128-bit key
  lengths), and provide exportable (meet United States government
  export restriction requirements) encryption.  This document describes
  the implementation of those encryption types.

Furthermore, this RFC was re-classified as 'historic' by RFC 8429 [1] in
2018, stating that 'none of the encryption types it specifies should be
used'

Note that other outdated algorithms are left in place (some of which are
guarded by CONFIG_SUNRPC_DISABLE_INSECURE_ENCTYPES), so this should only
adversely affect interoperability with Windows NT/2000 systems that have
not received any updates since 2008 (but are connected to a network
nonetheless)

[0] https://tools.ietf.org/html/rfc4757
[1] https://tools.ietf.org/html/rfc8429

Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
Acked-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-09-11 14:39:15 +10:00
Neal Cardwell
5050bef873 tcp: Simplify tcp_set_congestion_control() load=false case
Simplify tcp_set_congestion_control() by removing the initialization
code path for the !load case.

There are only two call sites for tcp_set_congestion_control(). The
EBPF call site is the only one that passes load=false; it also passes
cap_net_admin=true. Because of that, the exact same behavior can be
achieved by removing the special if (!load) branch of the logic. Both
before and after this commit, the EBPF case will call
bpf_try_module_get(), and if that succeeds then call
tcp_reinit_congestion_control() or if that fails then return EBUSY.

Note that this returns the logic to a structure very similar to the
structure before:
  commit 91b5b21c7c ("bpf: Add support for changing congestion control")
except that the CAP_NET_ADMIN status is passed in as a function
argument.

This clean-up was suggested by Martin KaFai Lau.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Cc: Lawrence Brakmo <brakmo@fb.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Kevin Yang <yyd@google.com>
2020-09-10 20:53:01 -07:00
Neal Cardwell
5cdc744caa tcp: simplify _bpf_setsockopt(): Remove flags argument
Now that the previous patches have removed the code that uses the
flags argument to _bpf_setsockopt(), we can remove that argument.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Neal Cardwell
29a949325c tcp: simplify tcp_set_congestion_control(): Always reinitialize
Now that the previous patches ensure that all call sites for
tcp_set_congestion_control() want to initialize congestion control, we
can simplify tcp_set_congestion_control() by removing the reinit
argument and the code to support it.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Neal Cardwell
e7b10a4dd1 tcp: Simplify EBPF TCP_CONGESTION to always init CC
Now that the previous patch ensures we don't initialize the congestion
control twice, when EBPF sets the congestion control algorithm at
connection establishment we can simplify the code by simply
initializing the congestion control module at that time.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Neal Cardwell
8919a9b31e tcp: Only init congestion control if not initialized already
Change tcp_init_transfer() to only initialize congestion control if it
has not been initialized already.

With this new approach, we can arrange things so that if the EBPF code
sets the congestion control by calling setsockopt(TCP_CONGESTION) then
tcp_init_transfer() will not re-initialize the CC module.

This is an approach that has the following beneficial properties:

(1) This allows CC module customizations made by the EBPF called in
    tcp_init_transfer() to persist, and not be wiped out by a later
    call to tcp_init_congestion_control() in tcp_init_transfer().

(2) Does not flip the order of EBPF and CC init, to avoid causing bugs
    for existing code upstream that depends on the current order.

(3) Does not cause 2 initializations for for CC in the case where the
    EBPF called in tcp_init_transfer() wants to set the CC to a new CC
    algorithm.

(4) Allows follow-on simplifications to the code in net/core/filter.c
    and net/ipv4/tcp_cong.c, which currently both have some complexity
    to special-case CC initialization to avoid double CC
    initialization if EBPF sets the CC.

Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Kevin Yang <yyd@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
2020-09-10 20:53:01 -07:00
Karsten Graul
22ef473dbd net/smc: use separate work queues for different worker types
There are 6 types of workers which exist per smc connection. 3 of them
are used for listen and handshake processing, another 2 are used for
close and abort processing and 1 is the tx worker that moves calls to
sleeping functions into a worker.
To prevent flooding of the system work queue when many connections are
opened or closed at the same time (some pattern uperf implements), move
those workers to one of 3 smc-specific work queues. Two work queues are
module-global and used for handshake and close workers. The third work
queue is defined per link group and used by the tx workers that may
sleep waiting for resources of this link group.
And in smc_llc_enqueue() queue the llc_event_work work to the system
prio work queue because its critical that this work is started fast.

Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:27 -07:00
Guvenc Gulce
8418cb4065 net/smc: use the retry mechanism for netlink messages
When the netlink messages to be sent to the userspace
are too big for a single netlink message, send them in
chunks using the netlink_dump infrastructure. Modify the
smc diag dump code so that it can signal to the netlink_dump
infrastructure that it needs to send more data.

Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:27 -07:00
Ursula Braun
f9aab6f2ce net/smc: immediate freeing in smc_lgr_cleanup_early()
smc_lgr_cleanup_early() schedules the free worker with delay. DMB
unregistering occurs in this delayed worker increasing the risk
to reach the SMCD SBA limit without need. Terminate the
linkgroup immediately, since termination means early DMB unregistering.

For SMCD the global smc_server_lgr_pending lock is given up early.
A linkgroup to be given up with smc_lgr_cleanup_early() may already
contain more than one connection. Using __smc_lgr_terminate() in
smc_lgr_cleanup_early() covers this.

And consolidate smc_ism_put_vlan() and smc_put_device() into smc_lgr_free()
only.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Ursula Braun
0c881ada3d net/smc: reduce smc_listen_decline() calls
smc_listen_work() contains already an smc_listen_decline() exit.
Use this exit for smc_listen_rdma_finish() problems as well.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Ursula Braun
7b2977d083 net/smc: improve server ISM device determination
Move check whether peer can be reached into smc_pnet_find_ism_by_pnetid().
Thus searching continues for another ism device, if check fails.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Ursula Braun
3d9725a6a1 net/smc: common routine for CLC accept and confirm
smc_clc_send_accept() and smc_clc_send_confirm() are quite similar.
Move common code into a separate function smc_clc_send_confirm_accept().
And introduce separate SMCD and SMCR struct definitions for CLC accept
resp. confirm.
No functional change.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Ursula Braun
6bb14e48ee net/smc: dynamic allocation of CLC proposal buffer
Reduce stack size for smc_listen_work() and smc_clc_send_proposal()
by dynamic allocation of the CLC buffer to be received or sent.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Ursula Braun
5ac54d8768 net/smc: introduce better field names
Field names "srv_first_contact" and "cln_first_contact" are misleading,
since they apply to both, server and client. Rename them to
"first_contact_peer" and "first_contact_local".
Rename "ism_gid" by the more precise name "ism_peer_gid".
Rename version constant "SMC_CLC_V1" into "SMC_V1".
No functional change.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Ursula Braun
a60a2b1e0a net/smc: reduce active tcp_listen workers
SMC starts a separate tcp_listen worker for every SMC socket in
state SMC_LISTEN, and can accept an incoming connection request only,
if this worker is really running and waiting in kernel_accept(). But
the number of running workers is limited.
This patch reworks the listening SMC code and starts a tcp_listen worker
after the SYN-ACK handshake on the internal clc-socket only.

Suggested-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:24:26 -07:00
Petr Machata
297e77e53e net: DCB: Validate DCB_ATTR_DCB_BUFFER argument
The parameter passed via DCB_ATTR_DCB_BUFFER is a struct dcbnl_buffer. The
field prio2buffer is an array of IEEE_8021Q_MAX_PRIORITIES bytes, where
each value is a number of a buffer to direct that priority's traffic to.
That value is however never validated to lie within the bounds set by
DCBX_MAX_BUFFERS. The only driver that currently implements the callback is
mlx5 (maintainers CCd), and that does not do any validation either, in
particual allowing incorrect configuration if the prio2buffer value does
not fit into 4 bits.

Instead of offloading the need to validate the buffer index to drivers, do
it right there in core, and bounce the request if the value is too large.

CC: Parav Pandit <parav@nvidia.com>
CC: Saeed Mahameed <saeedm@nvidia.com>
Fixes: e549f6f9c0 ("net/dcb: Add dcbnl buffer attribute")
Signed-off-by: Petr Machata <petrm@nvidia.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:09:08 -07:00
Ido Schimmel
e1b9efe6ba net: Fix bridge enslavement failure
When a netdev is enslaved to a bridge, its parent identifier is queried.
This is done so that packets that were already forwarded in hardware
will not be forwarded again by the bridge device between netdevs
belonging to the same hardware instance.

The operation fails when the netdev is an upper of netdevs with
different parent identifiers.

Instead of failing the enslavement, have dev_get_port_parent_id() return
'-EOPNOTSUPP' which will signal the bridge to skip the query operation.
Other callers of the function are not affected by this change.

Fixes: 7e1146e8c1 ("net: devlink: introduce devlink_compat_switch_id_get() helper")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reported-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Nikolay Aleksandrov <nikolay@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 15:06:48 -07:00
Miaohe Lin
1be107de2e net: Correct the comment of dst_dev_put()
Since commit 8d7017fd62 ("blackhole_netdev: use blackhole_netdev to
invalidate dst entries"), we use blackhole_netdev to invalidate dst entries
instead of loopback device anymore.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:28:57 -07:00
Wei Wang
ac8f1710c1 tcp: reflect tos value received in SYN to the socket
This commit adds a new TCP feature to reflect the tos value received in
SYN, and send it out on the SYN-ACK, and eventually set the tos value of
the established socket with this reflected tos value. This provides a
way to set the traffic class/QoS level for all traffic in the same
connection to be the same as the incoming SYN request. It could be
useful in data centers to provide equivalent QoS according to the
incoming request.
This feature is guarded by /proc/sys/net/ipv4/tcp_reflect_tos, and is by
default turned off.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Wei Wang
de033b7d15 ip: pass tos into ip_build_and_send_pkt()
This commit adds tos as a new passed in parameter to
ip_build_and_send_pkt() which will be used in the later commit.
This is a pure restructure and does not have any functional change.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Wei Wang
e9b12edc13 tcp: record received TOS value in the request socket
A new field is added to the request sock to record the TOS value
received on the listening socket during 3WHS:
When not under syn flood, it is recording the TOS value sent in SYN.
When under syn flood, it is recording the TOS value sent in the ACK.
This is a preparation patch in order to do TOS reflection in the later
commit.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:15:40 -07:00
Jakub Kicinski
5251ef8299 net: make sure napi_list is safe for RCU traversal
netpoll needs to traverse dev->napi_list under RCU, make
sure it uses the right iterator and that removal from this
list is handled safely.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jakub Kicinski
4d092dd204 net: manage napi add/del idempotence explicitly
To RCUify napi->dev_list we need to replace list_del_init()
with list_del_rcu(). There is no _init() version for RCU for
obvious reasons. Up until now netif_napi_del() was idempotent
so to make sure it remains such add a bit which is set when
NAPI is listed, and cleared when it removed. Since we don't
expect multiple calls to netif_napi_add() to be correct,
add a warning on that side.

Now that napi_hash_add / napi_hash_del are only called by
napi_add / del we can actually steal its bit. We just need
to make sure hash node is initialized correctly.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jakub Kicinski
5198d545db net: remove napi_hash_del() from driver-facing API
We allow drivers to call napi_hash_del() before calling
netif_napi_del() to batch RCU grace periods. This makes
the API asymmetric and leaks internal implementation details.
Soon we will want the grace period to protect more than just
the NAPI hash table.

Restructure the API and have drivers call a new function -
__netif_napi_del() if they want to take care of RCU waits.

Note that only core was checking the return status from
napi_hash_del() so the new helper does not report if the
NAPI was actually deleted.

Some notes on driver oddness:
 - veth observed the grace period before calling netif_napi_del()
   but that should not matter
 - myri10ge observed normal RCU flavor
 - bnx2x and enic did not actually observe the grace period
   (unless they did so implicitly)
 - virtio_net and enic only unhashed Rx NAPIs

The last two points seem to indicate that the calls to
napi_hash_del() were a left over rather than an optimization.
Regardless, it's easy enough to correct them.

This patch may introduce extra synchronize_net() calls for
interfaces which set NAPI_STATE_NO_BUSY_POLL and depend on
free_netdev() to call netif_napi_del(). This seems inevitable
since we want to use RCU for netpoll dev->napi_list traversal,
and almost no drivers set IFF_DISABLE_NETPOLL.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 13:08:46 -07:00
Jakub Kicinski
3ea87ca772 devlink: don't crash if netdev is NULL
Following change will add support for a corner case where
we may not have a netdev to pass to devlink_port_type_eth_set()
but we still want to set port type.

This is definitely a corner case, and drivers should not normally
pass NULL netdev - print a warning message when this happens.

Sadly for other port types (ib) switches don't have a device
reference, the way we always do for Ethernet, so we can't put
the warning in __devlink_port_type_set().

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:49:00 -07:00
Yunsheng Lin
2fb541c862 net: sch_generic: aviod concurrent reset and enqueue op for lockless qdisc
Currently there is concurrent reset and enqueue operation for the
same lockless qdisc when there is no lock to synchronize the
q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in
qdisc_deactivate() called by dev_deactivate_queue(), which may cause
out-of-bounds access for priv->ring[] in hns3 driver if user has
requested a smaller queue num when __dev_xmit_skb() still enqueue a
skb with a larger queue_mapping after the corresponding qdisc is
reset, and call hns3_nic_net_xmit() with that skb later.

Reused the existing synchronize_net() in dev_deactivate_many() to
make sure skb with larger queue_mapping enqueued to old qdisc(which
is saved in dev_queue->qdisc_sleeping) will always be reset when
dev_reset_queue() is called.

Fixes: 6b3ba9146f ("net: sched: allow qdiscs to handle locking")
Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:38:26 -07:00
Lorenz Bauer
0365351524 net: Allow iterating sockmap and sockhash
Add bpf_iter support for sockmap / sockhash, based on the bpf_sk_storage and
hashtable implementation. sockmap and sockhash share the same iteration
context: a pointer to an arbitrary key and a pointer to a socket. Both
pointers may be NULL, and so BPF has to perform a NULL check before accessing
them. Technically it's not possible for sockhash iteration to yield a NULL
socket, but we ignore this to be able to use a single iteration point.

Iteration will visit all keys that remain unmodified during the lifetime of
the iterator. It may or may not visit newly added ones.

Switch from using rcu_dereference_raw to plain rcu_dereference, so we gain
another guard rail if CONFIG_PROVE_RCU is enabled.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200909162712.221874-3-lmb@cloudflare.com
2020-09-10 12:31:55 -07:00
Lorenz Bauer
654785a1af net: sockmap: Remove unnecessary sk_fullsock checks
The lookup paths for sockmap and sockhash currently include a check
that returns NULL if the socket we just found is not a full socket.
However, this check is not necessary. On insertion we ensure that
we have a full socket (caveat around sock_ops), so request sockets
are not a problem. Time-wait sockets are allocated separate from
the original socket and then fed into the hashdance. They don't
affect the sockets already stored in the sockmap.

Suggested-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200909162712.221874-2-lmb@cloudflare.com
2020-09-10 12:31:55 -07:00
Geliang Tang
f612eb76f3 mptcp: fix kmalloc flag in mptcp_pm_nl_get_local_id
mptcp_pm_nl_get_local_id may be called in interrupt context, so we need to
use GFP_ATOMIC flag to allocate memory to avoid sleeping in atomic context.

[  280.209809] BUG: sleeping function called from invalid context at mm/slab.h:498
[  280.209812] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 1680, name: kworker/1:3
[  280.209814] INFO: lockdep is turned off.
[  280.209816] CPU: 1 PID: 1680 Comm: kworker/1:3 Tainted: G        W         5.9.0-rc3-mptcp+ #146
[  280.209818] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  280.209820] Workqueue: events mptcp_worker
[  280.209822] Call Trace:
[  280.209824]  <IRQ>
[  280.209826]  dump_stack+0x77/0xa0
[  280.209829]  ___might_sleep.cold+0xa6/0xb6
[  280.209832]  kmem_cache_alloc_trace+0x1d1/0x290
[  280.209835]  mptcp_pm_nl_get_local_id+0x23c/0x410
[  280.209840]  subflow_init_req+0x1e9/0x2ea
[  280.209843]  ? inet_reqsk_alloc+0x1c/0x120
[  280.209845]  ? kmem_cache_alloc+0x264/0x290
[  280.209849]  tcp_conn_request+0x303/0xae0
[  280.209854]  ? printk+0x53/0x6a
[  280.209857]  ? tcp_rcv_state_process+0x28f/0x1374
[  280.209859]  tcp_rcv_state_process+0x28f/0x1374
[  280.209864]  ? tcp_v4_do_rcv+0xb3/0x1f0
[  280.209866]  tcp_v4_do_rcv+0xb3/0x1f0
[  280.209869]  tcp_v4_rcv+0xed6/0xfa0
[  280.209873]  ip_protocol_deliver_rcu+0x28/0x270
[  280.209875]  ip_local_deliver_finish+0x89/0x120
[  280.209877]  ip_local_deliver+0x180/0x220
[  280.209881]  ip_rcv+0x166/0x210
[  280.209885]  __netif_receive_skb_one_core+0x82/0x90
[  280.209888]  process_backlog+0xd6/0x230
[  280.209891]  net_rx_action+0x13a/0x410
[  280.209895]  __do_softirq+0xcf/0x468
[  280.209899]  asm_call_on_stack+0x12/0x20
[  280.209901]  </IRQ>
[  280.209903]  ? ip_finish_output2+0x240/0x9a0
[  280.209906]  do_softirq_own_stack+0x4d/0x60
[  280.209908]  do_softirq.part.0+0x2b/0x60
[  280.209911]  __local_bh_enable_ip+0x9a/0xa0
[  280.209913]  ip_finish_output2+0x264/0x9a0
[  280.209916]  ? rcu_read_lock_held+0x4d/0x60
[  280.209920]  ? ip_output+0x7a/0x250
[  280.209922]  ip_output+0x7a/0x250
[  280.209925]  ? __ip_finish_output+0x330/0x330
[  280.209928]  __ip_queue_xmit+0x1dc/0x5a0
[  280.209931]  __tcp_transmit_skb+0xa0f/0xc70
[  280.209937]  tcp_connect+0xb03/0xff0
[  280.209939]  ? lockdep_hardirqs_on_prepare+0xe7/0x190
[  280.209942]  ? ktime_get_with_offset+0x125/0x150
[  280.209944]  ? trace_hardirqs_on+0x1c/0xe0
[  280.209948]  tcp_v4_connect+0x449/0x550
[  280.209953]  __inet_stream_connect+0xbb/0x320
[  280.209955]  ? mark_held_locks+0x49/0x70
[  280.209958]  ? lockdep_hardirqs_on_prepare+0xe7/0x190
[  280.209960]  ? __local_bh_enable_ip+0x6b/0xa0
[  280.209963]  inet_stream_connect+0x32/0x50
[  280.209966]  __mptcp_subflow_connect+0x1fd/0x242
[  280.209972]  mptcp_pm_create_subflow_or_signal_addr+0x2db/0x600
[  280.209975]  mptcp_worker+0x543/0x7a0
[  280.209980]  process_one_work+0x26d/0x5b0
[  280.209984]  ? process_one_work+0x5b0/0x5b0
[  280.209987]  worker_thread+0x48/0x3d0
[  280.209990]  ? process_one_work+0x5b0/0x5b0
[  280.209993]  kthread+0x117/0x150
[  280.209996]  ? kthread_park+0x80/0x80
[  280.209998]  ret_from_fork+0x22/0x30

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:30:03 -07:00
Geliang Tang
2ff0e566fa mptcp: fix subflow's remote_id issues
This patch set the init remote_id to zero, otherwise it will be a random
number.

Then it added the missing subflow's remote_id setting code both in
__mptcp_subflow_connect and in subflow_ulp_clone.

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Fixes: ec3edaa7ca ("mptcp: Add handling of outgoing MP_JOIN requests")
Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:29:15 -07:00
Geliang Tang
57025817ea mptcp: fix subflow's local_id issues
In mptcp_pm_nl_get_local_id, skc_local is the same as msk_local, so it
always return 0. Thus every subflow's local_id is 0. It's incorrect.

This patch fixed this issue.

Also, we need to ignore the zero address here, like 0.0.0.0 in IPv4. When
we use the zero address as a local address, it means that we can use any
one of the local addresses. The zero address is not a new address, we don't
need to add it to PM, so this patch added a new function address_zero to
check whether an address is the zero address, if it is, we ignore this
address.

Fixes: 01cacb00b3 ("mptcp: add netlink-based PM")
Signed-off-by: Geliang Tang <geliangtang@gmail.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:29:15 -07:00
Paul Davey
bb82067c57 ipmr: Use full VIF ID in netlink cache reports
Insert the full 16 bit VIF ID into ipmr Netlink cache reports.

The VIF_ID attribute has 32 bits of space so can store the full VIF ID
extracted from the high and low byte fields in the igmpmsg.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:25:51 -07:00
Paul Davey
c8715a8e9f ipmr: Add high byte of VIF ID to igmpmsg
Use the unused3 byte in struct igmpmsg to hold the high 8 bits of the
VIF ID.

If using more than 255 IPv4 multicast interfaces it is necessary to have
access to a VIF ID for cache reports that is wider than 8 bits, the VIF
ID present in the igmpmsg reports sent to mroute_sk was only 8 bits wide
in the igmpmsg header.  Adding the high 8 bits of the 16 bit VIF ID in
the unused byte allows use of more than 255 IPv4 multicast interfaces.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:25:51 -07:00
Paul Davey
501cb00890 ipmr: Add route table ID to netlink cache reports
Insert the multicast route table ID as a Netlink attribute to Netlink
cache report notifications.

When multiple route tables are in use it is necessary to have a way to
determine which route table a given cache report belongs to when
receiving the cache report.

Signed-off-by: Paul Davey <paul.davey@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:25:51 -07:00
Tetsuo Handa
a4b5cc9e10 tipc: fix shutdown() of connection oriented socket
I confirmed that the problem fixed by commit 2a63866c8b ("tipc: fix
shutdown() of connectionless socket") also applies to stream socket.

----------
#include <sys/socket.h>
#include <unistd.h>
#include <sys/wait.h>

int main(int argc, char *argv[])
{
        int fds[2] = { -1, -1 };
        socketpair(PF_TIPC, SOCK_STREAM /* or SOCK_DGRAM */, 0, fds);
        if (fork() == 0)
                _exit(read(fds[0], NULL, 1));
        shutdown(fds[0], SHUT_RDWR); /* This must make read() return. */
        wait(NULL); /* To be woken up by _exit(). */
        return 0;
}
----------

Since shutdown(SHUT_RDWR) should affect all processes sharing that socket,
unconditionally setting sk->sk_shutdown to SHUTDOWN_MASK will be the right
behavior.

Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-10 12:21:39 -07:00
Chen Zhou
e9091bb77f bpf: Remove duplicate headers
Remove duplicate headers which are included twice.

Signed-off-by: Chen Zhou <chenzhou10@huawei.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200908132201.184005-1-chenzhou10@huawei.com
2020-09-10 10:53:14 -07:00
Parav Pandit
66b17082d1 devlink: Use controller while building phys_port_name
Now that controller number attribute is available, use it when
building phsy_port_name for external controller ports.

An example devlink port and representor netdev name consist of controller
annotation for external controller with controller number = 1,
for a VF 1 of PF 0:

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev ens2f0c1pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port show pci/0000:06:00.0/2 -jp
{
    "port": {
        "pci/0000:06:00.0/2": {
            "type": "eth",
            "netdev": "ens2f0c1pf0vf1",
            "flavour": "pcivf",
            "controller": 1,
            "pfnum": 0,
            "vfnum": 1,
            "external": true,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:00:00"
            }
        }
    }
}

Controller number annotation is skipped for non external controllers to
maintain backward compatibility.

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 14:19:55 -07:00
Parav Pandit
3a2d9588c4 devlink: Introduce controller number
A devlink port may be for a controller consist of PCI device.
A devlink instance holds ports of two types of controllers.
(1) controller discovered on same system where eswitch resides
This is the case where PCI PF/VF of a controller and devlink eswitch
instance both are located on a single system.
(2) controller located on external host system.
This is the case where a controller is located in one system and its
devlink eswitch ports are located in a different system.

When a devlink eswitch instance serves the devlink ports of both
controllers together, PCI PF/VF numbers may overlap.
Due to this a unique phys_port_name cannot be constructed.

For example in below such system controller-0 and controller-1, each has
PCI PF pf0 whose eswitch ports can be present in controller-0.
These results in phys_port_name as "pf0" for both.
Similar problem exists for VFs and upcoming Sub functions.

An example view of two controller systems:

             ---------------------------------------------------------
             |                                                       |
             |           --------- ---------         ------- ------- |
-----------  |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
| server  |  | -------   ----/---- ---/----- ------- ---/--- ---/--- |
| pci rc  |=== | pf0 |______/________/       | pf1 |___/_______/     |
| connect |  | -------                       -------                 |
-----------  |     | controller_num=1 (no eswitch)                   |
             ------|--------------------------------------------------
             (internal wire)
                   |
             ---------------------------------------------------------
             | devlink eswitch ports and reps                        |
             | ----------------------------------------------------- |
             | |ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 | ctrl-0 |ctrl-0 | |
             | |pf0    | pf0vfN | pf0sfN | pf1    | pf1vfN |pf1sfN | |
             | ----------------------------------------------------- |
             | |ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 | ctrl-1 |ctrl-1 | |
             | |pf1    | pf1vfN | pf1sfN | pf1    | pf1vfN |pf0sfN | |
             | ----------------------------------------------------- |
             |                                                       |
             |                                                       |
             |           --------- ---------         ------- ------- |
             |           | vf(s) | | sf(s) |         |vf(s)| |sf(s)| |
             | -------   ----/---- ---/----- ------- ---/--- ---/--- |
             | | pf0 |______/________/       | pf1 |___/_______/     |
             | -------                       -------                 |
             |                                                       |
             |  local controller_num=0 (eswitch)                     |
             ---------------------------------------------------------

An example devlink port for external controller with controller
number = 1 for a VF 1 of PF 0:

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf controller 1 pfnum 0 vfnum 1 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00

$ devlink port show pci/0000:06:00.0/2 -jp
{
    "port": {
        "pci/0000:06:00.0/2": {
            "type": "eth",
            "netdev": "ens2f0pf0vf1",
            "flavour": "pcivf",
            "controller": 1,
            "pfnum": 0,
            "vfnum": 1,
            "external": true,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:00:00"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 14:19:55 -07:00
Parav Pandit
05b595e9c4 devlink: Introduce external controller flag
A devlink eswitch port may represent PCI PF/VF ports of a controller.

A controller either located on same system or it can be an external
controller located in host where such NIC is plugged in.

Add the ability for driver to specify if a port is for external
controller.

Use such flag in the mlx5_core driver.

An example of an external controller having VF1 of PF0 belong to
controller 1.

$ devlink port show pci/0000:06:00.0/2
pci/0000:06:00.0/2: type eth netdev ens2f0pf0vf1 flavour pcivf pfnum 0 vfnum 1 external true splittable false
  function:
    hw_addr 00:00:00:00:00:00
$ devlink port show pci/0000:06:00.0/2 -jp
{
    "port": {
        "pci/0000:06:00.0/2": {
            "type": "eth",
            "netdev": "ens2f0pf0vf1",
            "flavour": "pcivf",
            "pfnum": 0,
            "vfnum": 1,
            "external": true,
            "splittable": false,
            "function": {
                "hw_addr": "00:00:00:00:00:00"
            }
        }
    }
}

Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 14:19:55 -07:00
David S. Miller
d85427e3c8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Rewrite inner header IPv6 in ICMPv6 messages in ip6t_NPT,
   from Michael Zhou.

2) do_ip_vs_set_ctl() dereferences uninitialized value,
   from Peilin Ye.

3) Support for userdata in tables, from Jose M. Guisado.

4) Do not increment ct error and invalid stats at the same time,
   from Florian Westphal.

5) Remove ct ignore stats, also from Florian.

6) Add ct stats for clash resolution, from Florian Westphal.

7) Bump reference counter bump on ct clash resolution only,
   this is safe because bucket lock is held, again from Florian.

8) Use ip_is_fragment() in xt_HMARK, from YueHaibing.

9) Add wildcard support for nft_socket, from Balazs Scheidler.

10) Remove superfluous IPVS dependency on iptables, from
    Yaroslav Bolyukin.

11) Remove unused definition in ebt_stp, from Wang Hai.

12) Replace CONFIG_NFT_CHAIN_NAT_{IPV4,IPV6} by CONFIG_NFT_NAT
    in selftests/net, from Fabian Frederick.

13) Add userdata support for nft_object, from Jose M. Guisado.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 11:21:19 -07:00
Ye Bin
b87f9fe1ac hsr: avoid newline at end of message in NL_SET_ERR_MSG_MOD
clean follow coccicheck warning:
net//hsr/hsr_netlink.c:94:8-42: WARNING avoid newline at end of message
in NL_SET_ERR_MSG_MOD
net//hsr/hsr_netlink.c:87:30-57: WARNING avoid newline at end of message
in NL_SET_ERR_MSG_MOD
net//hsr/hsr_netlink.c:79:29-53: WARNING avoid newline at end of message
in NL_SET_ERR_MSG_MOD

Signed-off-by: Ye Bin <yebin10@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 11:15:26 -07:00
Linus Torvalds
ab29a807a7 NFS client bugfixes for Linux 5.9
Highlights include:
 
 Bugfixes:
 - Fix an NFS/RDMA resource leak
 - Fix the error handling during delegation recall
 - NFSv4.0 needs to return the delegation on a zero-stateid SETATTR
 - Stop printk reading past end of string
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl9ZFYAACgkQZwvnipYK
 APLg+RAArQ0J54M4vTg7avKhUEwIrAlPCFjHvZ5jtlXiY8JDT7Cy2lEo9W/pC9x2
 BiV02H6seKXq6vKUHIBgzVq0BdZBKeWQcOpoO/dfvWSPs9u+lxKlOEwcdsaXwdXz
 31u5HS4xHYg2SlYj+BcKGfVexcWVEVyPqqPvflGBZIlKfzQLHo9YY390deUHMC6o
 HrRXWADvpYXC1sJb3mtNtCojqr9a5A8Ty4clT19YvdwQL7cUt3HjjsOvJfbmB9S+
 fW5/u3sdWJ1nYoz8AxC+utIMNmtXFBUhW0Sg+TPWMJj8yG9rclAgTxbobhXyzGph
 j2ZamPhUtpcSYXBlwiQCm7GbUIItnzHgU6MSCs/nq8AeDc3WEx4qVONVqNvNr/sY
 1T3znylZpXCHvxLmDWzDGsW8XvZT1r86Lm6zrJCmjWm+eoSKBzeoENcXGsGGYuJu
 6NGz7pgQbYMb9t7VfOEFSxxt5w0wt7nRyhV1R7taBhm5B9XjF+BOmJBI0epQ1S7i
 XRIr7WqxT00wijWyunNCQZxi1aDMHVYZXPwaqkEHTwJqeDzCtmir+ajAnZQUgUId
 1MNiv8BDoN5YlPmj/gt+E3kbyj0Pu7M+09NvVEKqG7j8W80ltf6eb85XGrq+vp1E
 Y0lmDXElBdNo3AA+dBOmk+peoVv4bfoog5PymElaRiwRM25VCOM=
 =3fw2
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:

 - Fix an NFS/RDMA resource leak

 - Fix the error handling during delegation recall

 - NFSv4.0 needs to return the delegation on a zero-stateid SETATTR

 - Stop printk reading past end of string

* tag 'nfs-for-5.9-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  SUNRPC: stop printk reading past end of string
  NFS: Zero-stateid SETATTR should first return delegation
  NFSv4.1 handle ERR_DELAY error reclaiming locking state on delegation recall
  xprtrdma: Release in-flight MRs on disconnect
2020-09-09 11:14:20 -07:00
Eric Dumazet
3ca1a42a52 net: qrtr: check skb_put_padto() return value
If skb_put_padto() returns an error, skb has been freed.
Better not touch it anymore, as reported by syzbot [1]

Note to qrtr maintainers : this suggests qrtr_sendmsg()
should adjust sock_alloc_send_skb() second parameter
to account for the potential added alignment to avoid
reallocation.

[1]

BUG: KASAN: use-after-free in __skb_insert include/linux/skbuff.h:1907 [inline]
BUG: KASAN: use-after-free in __skb_queue_before include/linux/skbuff.h:2016 [inline]
BUG: KASAN: use-after-free in __skb_queue_tail include/linux/skbuff.h:2049 [inline]
BUG: KASAN: use-after-free in skb_queue_tail+0x6b/0x120 net/core/skbuff.c:3146
Write of size 8 at addr ffff88804d8ab3c0 by task syz-executor.4/4316

CPU: 1 PID: 4316 Comm: syz-executor.4 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1d6/0x29e lib/dump_stack.c:118
 print_address_description+0x66/0x620 mm/kasan/report.c:383
 __kasan_report mm/kasan/report.c:513 [inline]
 kasan_report+0x132/0x1d0 mm/kasan/report.c:530
 __skb_insert include/linux/skbuff.h:1907 [inline]
 __skb_queue_before include/linux/skbuff.h:2016 [inline]
 __skb_queue_tail include/linux/skbuff.h:2049 [inline]
 skb_queue_tail+0x6b/0x120 net/core/skbuff.c:3146
 qrtr_tun_send+0x1a/0x40 net/qrtr/tun.c:23
 qrtr_node_enqueue+0x44f/0xc00 net/qrtr/qrtr.c:364
 qrtr_bcast_enqueue+0xbe/0x140 net/qrtr/qrtr.c:861
 qrtr_sendmsg+0x680/0x9c0 net/qrtr/qrtr.c:960
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 sock_write_iter+0x317/0x470 net/socket.c:998
 call_write_iter include/linux/fs.h:1882 [inline]
 new_sync_write fs/read_write.c:503 [inline]
 vfs_write+0xa96/0xd10 fs/read_write.c:578
 ksys_write+0x11b/0x220 fs/read_write.c:631
 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45d5b9
Code: 5d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f84b5b81c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000038b40 RCX: 000000000045d5b9
RDX: 0000000000000055 RSI: 0000000020001240 RDI: 0000000000000003
RBP: 00007f84b5b81ca0 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000000000f
R13: 00007ffcbbf86daf R14: 00007f84b5b829c0 R15: 000000000118cf4c

Allocated by task 4316:
 kasan_save_stack mm/kasan/common.c:48 [inline]
 kasan_set_track mm/kasan/common.c:56 [inline]
 __kasan_kmalloc+0x100/0x130 mm/kasan/common.c:461
 slab_post_alloc_hook+0x3e/0x290 mm/slab.h:518
 slab_alloc mm/slab.c:3312 [inline]
 kmem_cache_alloc+0x1c1/0x2d0 mm/slab.c:3482
 skb_clone+0x1b2/0x370 net/core/skbuff.c:1449
 qrtr_bcast_enqueue+0x6d/0x140 net/qrtr/qrtr.c:857
 qrtr_sendmsg+0x680/0x9c0 net/qrtr/qrtr.c:960
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 sock_write_iter+0x317/0x470 net/socket.c:998
 call_write_iter include/linux/fs.h:1882 [inline]
 new_sync_write fs/read_write.c:503 [inline]
 vfs_write+0xa96/0xd10 fs/read_write.c:578
 ksys_write+0x11b/0x220 fs/read_write.c:631
 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 4316:
 kasan_save_stack mm/kasan/common.c:48 [inline]
 kasan_set_track+0x3d/0x70 mm/kasan/common.c:56
 kasan_set_free_info+0x17/0x30 mm/kasan/generic.c:355
 __kasan_slab_free+0xdd/0x110 mm/kasan/common.c:422
 __cache_free mm/slab.c:3418 [inline]
 kmem_cache_free+0x82/0xf0 mm/slab.c:3693
 __skb_pad+0x3f5/0x5a0 net/core/skbuff.c:1823
 __skb_put_padto include/linux/skbuff.h:3233 [inline]
 skb_put_padto include/linux/skbuff.h:3252 [inline]
 qrtr_node_enqueue+0x62f/0xc00 net/qrtr/qrtr.c:360
 qrtr_bcast_enqueue+0xbe/0x140 net/qrtr/qrtr.c:861
 qrtr_sendmsg+0x680/0x9c0 net/qrtr/qrtr.c:960
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg net/socket.c:671 [inline]
 sock_write_iter+0x317/0x470 net/socket.c:998
 call_write_iter include/linux/fs.h:1882 [inline]
 new_sync_write fs/read_write.c:503 [inline]
 vfs_write+0xa96/0xd10 fs/read_write.c:578
 ksys_write+0x11b/0x220 fs/read_write.c:631
 do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at ffff88804d8ab3c0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
 224-byte region [ffff88804d8ab3c0, ffff88804d8ab4a0)
The buggy address belongs to the page:
page:00000000ea8cccfb refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88804d8abb40 pfn:0x4d8ab
flags: 0xfffe0000000200(slab)
raw: 00fffe0000000200 ffffea0002237ec8 ffffea00029b3388 ffff88821bb66800
raw: ffff88804d8abb40 ffff88804d8ab000 000000010000000b 0000000000000000
page dumped because: kasan: bad access detected

Fixes: ce57785bf9 ("net: qrtr: fix len of skb_put_padto in qrtr_node_enqueue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Carl Huang <cjhuang@codeaurora.org>
Cc: Wen Gong <wgong@codeaurora.org>
Cc: Bjorn Andersson <bjorn.andersson@linaro.org>
Cc: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Acked-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Reviewed-by: Bjorn Andersson <bjorn.andersson@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-09 11:04:39 -07:00
Wei Wang
e92dd77e6f ipv6: add tos reflection in TCP reset and ack
Currently, ipv6 stack does not do any TOS reflection. To make the
behavior consistent with v4 stack, this commit adds TOS reflection in
tcp_v6_reqsk_send_ack() and tcp_v6_send_reset(). We clear the lower
2-bit ECN value of the received TOS in compliance with RFC 3168 6.1.5
robustness principles.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:20:55 -07:00
Wei Wang
ba9e04a7dd ip: fix tos reflection in ack and reset packets
Currently, in tcp_v4_reqsk_send_ack() and tcp_v4_send_reset(), we
echo the TOS value of the received packets in the response.
However, we do not want to echo the lower 2 ECN bits in accordance
with RFC 3168 6.1.5 robustness principles.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:19:08 -07:00
David S. Miller
56bbc22d83 RxRPC development
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl9X6OcACgkQ+7dXa6fL
 C2s3Dg/9FyGc4Na08Rc5q8K2TSVhsx1dqBscf/t3gIulCaShbbRRFAb4UecrxOTn
 GV/eekGVejh/a+cCCVjMGFNP2FxyW9WTFoxINaTFp+bdGURdo8MOJTCTc/1Sgflb
 M7s+2mubYHfHLw4/VUR8tCAhODPC0JUTiDnkZqXZxA8SizUPAkNZySH7OfgW9PjA
 PFy/lB5IbR9mU64FWMALU2Rs09Gav/KRA+/X8fyE80i1MgYAz82U7x/EQa0XCmdR
 B61VObK8z9/o7Juu8GzM01k8b5zoWraHph0e/BdFaEN1KY5phPdPyerGa67GcgwX
 imnvcW4N75QBlHhE8C34eKnbIaNIPUR9ZNJdEgnA29CPOtP87rSPaeNNn4ISyF8r
 TnGGpAeBHMahrY2dhKK0lIv9Gbwd1j+OAlrL9O+j4KTaKZc9CqnXLzKp3ugXDeex
 RzHptlsc0I0bQ8ZeCA1jS9OFmypaRtYabE9DSrPS2Epbb8SdcCE4ZTXabB3A/ytk
 HBle/MfGcQN9yFAJpDJ/0Bj6PmUZgohVS7qMrZi+JV99vkNPebxxxEvoiM5il2km
 3DXPrD3rv9/qg8F6xVHYkWXVxZk6FCMe8/sg9VKUfOhmwEiFBkXv2IwTP/s3nN0g
 0h46rXAzjWXY8YMZzsgBHVe/Vp2MfJGsx9hyO1JppRbxGb1PVeU=
 =f+/X
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20200908' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Allow more calls to same peer

Here are some development patches for AF_RXRPC that allow more simultaneous
calls to be made to the same peer with the same security parameters.  The
current code allows a maximum of 4 simultaneous calls, which limits the afs
filesystem to that many simultaneous threads.  This increases the limit to
16.

To make this work, the way client connections are limited has to be changed
(incoming call/connection limits are unaffected) as the current code
depends on queuing calls on a connection and then pushing the connection
through a queue.  The limit is on the number of available connections.

This is changed such that there's a limit[*] on the total number of calls
systemwide across all namespaces, but the limit on the number of client
connections is removed.

Once a call is allowed to proceed, it finds a bundle of connections and
tries to grab a call slot.  If there's a spare call slot, fine, otherwise
it will wait.  If there's already a waiter, it will try to create another
connection in the bundle, unless the limit of 4 is reached (4 calls per
connection, giving 16).

A number of things throttle someone trying to set up endless connections:

 - Calls that fail immediately have their conns deleted immediately,

 - Calls that don't fail immediately have to wait for a timeout,

 - Connections normally get automatically reaped if they haven't been used
   for 2m, but this is sped up to 2s if the number of connections rises
   over 900.  This number is tunable by sysctl.

[*] Technically two limits - kernel sockets and userspace rxrpc sockets are
    accounted separately.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:18:17 -07:00
David S. Miller
6fd40d32ef Merge tag 'ieee802154-for-davem-2020-09-08' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan
Stefan Schmidt says:

====================
pull-request: ieee802154 for net 2020-09-08

An update from ieee802154 for your *net* tree.

A potential memory leak fix for ca8210 from Liu Jian,
a check on the return for a register read in adf7242
and finally a user after free fix in the softmac tx
function from Eric found by syzkaller.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:12:58 -07:00
Nikolay Aleksandrov
071445c605 net: bridge: mcast: fix unused br var when lockdep isn't defined
Stephen reported the following warning:
 net/bridge/br_multicast.c: In function 'br_multicast_find_port':
 net/bridge/br_multicast.c:1818:21: warning: unused variable 'br' [-Wunused-variable]
  1818 |  struct net_bridge *br = mp->br;
       |                     ^~

It happens due to bridge's mlock_dereference() when lockdep isn't defined.
Silence the warning by annotating the variable as __maybe_unused.

Fixes: 0436862e41 ("net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:11:57 -07:00
Brian Vazquez
923f614cdb fib: fix fib_rule_ops indirect call wrappers when CONFIG_IPV6=m
If CONFIG_IPV6=m, the IPV6 functions won't be found by the linker:

ld: net/core/fib_rules.o: in function `fib_rules_lookup':
fib_rules.c:(.text+0x606): undefined reference to `fib6_rule_match'
ld: fib_rules.c:(.text+0x611): undefined reference to `fib6_rule_match'
ld: fib_rules.c:(.text+0x68c): undefined reference to `fib6_rule_action'
ld: fib_rules.c:(.text+0x693): undefined reference to `fib6_rule_action'
ld: fib_rules.c:(.text+0x6aa): undefined reference to `fib6_rule_suppress'
ld: fib_rules.c:(.text+0x6bc): undefined reference to `fib6_rule_suppress'
make: *** [Makefile:1166: vmlinux] Error 1

Reported-by: Sven Joachim <svenjoac@gmx.de>
Fixes: b9aaec8f0b ("fib: use indirect call wrappers in the most common fib_rules_ops")
Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested
Signed-off-by: Brian Vazquez <brianvv@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:09:08 -07:00
David S. Miller
2650be2c2d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

===================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Allow conntrack entries with l3num == NFPROTO_IPV4 or == NFPROTO_IPV6
   only via ctnetlink, from Will McVicker.

2) Batch notifications to userspace to improve netlink socket receive
   utilization.

3) Restore mark based dump filtering via ctnetlink, from Martin Willi.

4) nf_conncount_init() fails with -EPROTO with CONFIG_IPV6, from
   Eelco Chaudron.

5) Containers fail to match on meta skuid and skgid, use socket user_ns
   to retrieve meta skuid and skgid.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:07:58 -07:00
Wang Hai
8c70b26817 netlabel: Fix some kernel-doc warnings
Fixes the following W=1 kernel build warning(s):

net/netlabel/netlabel_calipso.c:438: warning: Excess function parameter 'audit_secid' description in 'calipso_doi_remove'
net/netlabel/netlabel_calipso.c:605: warning: Excess function parameter 'reg' description in 'calipso_req_delattr'

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:04:27 -07:00
Wang Hai
7edce63666 cipso: fix 'audit_secid' kernel-doc warning in cipso_ipv4.c
Fixes the following W=1 kernel build warning(s):

net/ipv4/cipso_ipv4.c:510: warning: Excess function parameter 'audit_secid' description in 'cipso_v4_doi_remove'

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 20:03:36 -07:00
Eric Dumazet
843d926b00 ipv6: avoid lockdep issue in fib6_del()
syzbot reported twice a lockdep issue in fib6_del() [1]
which I think is caused by net->ipv6.fib6_null_entry
having a NULL fib6_table pointer.

fib6_del() already checks for fib6_null_entry special
case, we only need to return earlier.

Bug seems to occur very rarely, I have thus chosen
a 'bug origin' that makes backports not too complex.

[1]
WARNING: suspicious RCU usage
5.9.0-rc4-syzkaller #0 Not tainted
-----------------------------
net/ipv6/ip6_fib.c:1996 suspicious rcu_dereference_protected() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
4 locks held by syz-executor.5/8095:
 #0: ffffffff8a7ea708 (rtnl_mutex){+.+.}-{3:3}, at: ppp_release+0x178/0x240 drivers/net/ppp/ppp_generic.c:401
 #1: ffff88804c422dd8 (&net->ipv6.fib6_gc_lock){+.-.}-{2:2}, at: spin_trylock_bh include/linux/spinlock.h:414 [inline]
 #1: ffff88804c422dd8 (&net->ipv6.fib6_gc_lock){+.-.}-{2:2}, at: fib6_run_gc+0x21b/0x2d0 net/ipv6/ip6_fib.c:2312
 #2: ffffffff89bd6a40 (rcu_read_lock){....}-{1:2}, at: __fib6_clean_all+0x0/0x290 net/ipv6/ip6_fib.c:2613
 #3: ffff8880a82e6430 (&tb->tb6_lock){+.-.}-{2:2}, at: spin_lock_bh include/linux/spinlock.h:359 [inline]
 #3: ffff8880a82e6430 (&tb->tb6_lock){+.-.}-{2:2}, at: __fib6_clean_all+0x107/0x290 net/ipv6/ip6_fib.c:2245

stack backtrace:
CPU: 1 PID: 8095 Comm: syz-executor.5 Not tainted 5.9.0-rc4-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 fib6_del+0x12b4/0x1630 net/ipv6/ip6_fib.c:1996
 fib6_clean_node+0x39b/0x570 net/ipv6/ip6_fib.c:2180
 fib6_walk_continue+0x4aa/0x8e0 net/ipv6/ip6_fib.c:2102
 fib6_walk+0x182/0x370 net/ipv6/ip6_fib.c:2150
 fib6_clean_tree+0xdb/0x120 net/ipv6/ip6_fib.c:2230
 __fib6_clean_all+0x120/0x290 net/ipv6/ip6_fib.c:2246
 fib6_clean_all net/ipv6/ip6_fib.c:2257 [inline]
 fib6_run_gc+0x113/0x2d0 net/ipv6/ip6_fib.c:2320
 ndisc_netdev_event+0x217/0x350 net/ipv6/ndisc.c:1805
 notifier_call_chain+0xb5/0x200 kernel/notifier.c:83
 call_netdevice_notifiers_info+0xb5/0x130 net/core/dev.c:2033
 call_netdevice_notifiers_extack net/core/dev.c:2045 [inline]
 call_netdevice_notifiers net/core/dev.c:2059 [inline]
 dev_close_many+0x30b/0x650 net/core/dev.c:1634
 rollback_registered_many+0x3a8/0x1210 net/core/dev.c:9261
 rollback_registered net/core/dev.c:9329 [inline]
 unregister_netdevice_queue+0x2dd/0x570 net/core/dev.c:10410
 unregister_netdevice include/linux/netdevice.h:2774 [inline]
 ppp_release+0x216/0x240 drivers/net/ppp/ppp_generic.c:403
 __fput+0x285/0x920 fs/file_table.c:281
 task_work_run+0xdd/0x190 kernel/task_work.c:141
 tracehook_notify_resume include/linux/tracehook.h:188 [inline]
 exit_to_user_mode_loop kernel/entry/common.c:163 [inline]
 exit_to_user_mode_prepare+0x1e1/0x200 kernel/entry/common.c:190
 syscall_exit_to_user_mode+0x7e/0x2e0 kernel/entry/common.c:265
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Fixes: 421842edea ("net/ipv6: Add fib6_null_entry")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 19:56:25 -07:00
Vladimir Oltean
2f1e8ea726 net: dsa: link interfaces with the DSA master to get rid of lockdep warnings
Since commit 845e0ebb44 ("net: change addr_list_lock back to static
key"), cascaded DSA setups (DSA switch port as DSA master for another
DSA switch port) are emitting this lockdep warning:

============================================
WARNING: possible recursive locking detected
5.8.0-rc1-00133-g923e4b5032dd-dirty #208 Not tainted
--------------------------------------------
dhcpcd/323 is trying to acquire lock:
ffff000066dd4268 (&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync+0x44/0x90

but task is already holding lock:
ffff00006608c268 (&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync+0x44/0x90

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(&dsa_master_addr_list_lock_key/1);
  lock(&dsa_master_addr_list_lock_key/1);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by dhcpcd/323:
 #0: ffffdbd1381dda18 (rtnl_mutex){+.+.}-{3:3}, at: rtnl_lock+0x24/0x30
 #1: ffff00006614b268 (_xmit_ETHER){+...}-{2:2}, at: dev_set_rx_mode+0x28/0x48
 #2: ffff00006608c268 (&dsa_master_addr_list_lock_key/1){+...}-{2:2}, at: dev_mc_sync+0x44/0x90

stack backtrace:
Call trace:
 dump_backtrace+0x0/0x1e0
 show_stack+0x20/0x30
 dump_stack+0xec/0x158
 __lock_acquire+0xca0/0x2398
 lock_acquire+0xe8/0x440
 _raw_spin_lock_nested+0x64/0x90
 dev_mc_sync+0x44/0x90
 dsa_slave_set_rx_mode+0x34/0x50
 __dev_set_rx_mode+0x60/0xa0
 dev_mc_sync+0x84/0x90
 dsa_slave_set_rx_mode+0x34/0x50
 __dev_set_rx_mode+0x60/0xa0
 dev_set_rx_mode+0x30/0x48
 __dev_open+0x10c/0x180
 __dev_change_flags+0x170/0x1c8
 dev_change_flags+0x2c/0x70
 devinet_ioctl+0x774/0x878
 inet_ioctl+0x348/0x3b0
 sock_do_ioctl+0x50/0x310
 sock_ioctl+0x1f8/0x580
 ksys_ioctl+0xb0/0xf0
 __arm64_sys_ioctl+0x28/0x38
 el0_svc_common.constprop.0+0x7c/0x180
 do_el0_svc+0x2c/0x98
 el0_sync_handler+0x9c/0x1b8
 el0_sync+0x158/0x180

Since DSA never made use of the netdev API for describing links between
upper devices and lower devices, the dev->lower_level value of a DSA
switch interface would be 1, which would warn when it is a DSA master.

We can use netdev_upper_dev_link() to describe the relationship between
a DSA slave and a DSA master. To be precise, a DSA "slave" (switch port)
is an "upper" to a DSA "master" (host port). The relationship is "many
uppers to one lower", like in the case of VLAN. So, for that reason, we
use the same function as VLAN uses.

There might be a chance that somebody will try to take hold of this
interface and use it immediately after register_netdev() and before
netdev_upper_dev_link(). To avoid that, we do the registration and
linkage while holding the RTNL, and we use the RTNL-locked cousin of
register_netdev(), which is register_netdevice().

Since this warning was not there when lockdep was using dynamic keys for
addr_list_lock, we are blaming the lockdep patch itself. The network
stack _has_ been using static lockdep keys before, and it _is_ likely
that stacked DSA setups have been triggering these lockdep warnings
since forever, however I can't test very old kernels on this particular
stacked DSA setup, to ensure I'm not in fact introducing regressions.

Fixes: 845e0ebb44 ("net: change addr_list_lock back to static key")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 19:40:09 -07:00
Tom Rix
c1f1f16c4d net: sched: skip an unnecessay check
Reviewing the error handling in tcf_action_init_1()
most of the early handling uses

err_out:
	if (cookie) {
		kfree(cookie->data);
		kfree(cookie);
	}

before cookie could ever be set.

So skip the unnecessay check.

Signed-off-by: Tom Rix <trix@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-08 19:34:36 -07:00
David Howells
288827d53e rxrpc: Allow multiple client connections to the same peer
Allow the number of parallel connections to a machine to be expanded from a
single connection to a maximum of four.  This allows up to 16 calls to be
in progress at the same time to any particular peer instead of 4.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-08 21:11:47 +01:00
David Howells
245500d853 rxrpc: Rewrite the client connection manager
Rewrite the rxrpc client connection manager so that it can support multiple
connections for a given security key to a peer.  The following changes are
made:

 (1) For each open socket, the code currently maintains an rbtree with the
     connections placed into it, keyed by communications parameters.  This
     is tricky to maintain as connections can be culled from the tree or
     replaced within it.  Connections can require replacement for a number
     of reasons, e.g. their IDs span too great a range for the IDR data
     type to represent efficiently, the call ID numbers on that conn would
     overflow or the conn got aborted.

     This is changed so that there's now a connection bundle object placed
     in the tree, keyed on the same parameters.  The bundle, however, does
     not need to be replaced.

 (2) An rxrpc_bundle object can now manage the available channels for a set
     of parallel connections.  The lock that manages this is moved there
     from the rxrpc_connection struct (channel_lock).

 (3) There'a a dummy bundle for all incoming connections to share so that
     they have a channel_lock too.  It might be better to give each
     incoming connection its own bundle.  This bundle is not needed to
     manage which channels incoming calls are made on because that's the
     solely at whim of the client.

 (4) The restrictions on how many client connections are around are
     removed.  Instead, a previous patch limits the number of client calls
     that can be allocated.  Ordinarily, client connections are reaped
     after 2 minutes on the idle queue, but when more than a certain number
     of connections are in existence, the reaper starts reaping them after
     2s of idleness instead to get the numbers back down.

     It could also be made such that new call allocations are forced to
     wait until the number of outstanding connections subsides.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-08 21:11:43 +01:00
David Howells
b7a7d67408 rxrpc: Impose a maximum number of client calls
Impose a maximum on the number of client rxrpc calls that are allowed
simultaneously.  This will be in lieu of a maximum number of client
connections as this is easier to administed as, unlike connections, calls
aren't reusable (to be changed in a subsequent patch)..

This doesn't affect the limits on service calls and connections.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-09-08 21:10:45 +01:00
Daniel Borkmann
e6a18d3611 bpf: Fix clobbering of r2 in bpf_gen_ld_abs
Bryce reported that he saw the following with:

  0:  r6 = r1
  1:  r1 = 12
  2:  r0 = *(u16 *)skb[r1]

The xlated sequence was incorrectly clobbering r2 with pointer
value of r6 ...

  0: (bf) r6 = r1
  1: (b7) r1 = 12
  2: (bf) r1 = r6
  3: (bf) r2 = r1
  4: (85) call bpf_skb_load_helper_16_no_cache#7692160

... and hence call to the load helper never succeeded given the
offset was too high. Fix it by reordering the load of r6 to r1.

Other than that the insn has similar calling convention than BPF
helpers, that is, r0 - r5 are scratch regs, so nothing else
affected after the insn.

Fixes: e0cea7ce98 ("bpf: implement ld_abs/ld_ind in native bpf")
Reported-by: Bryce Kahle <bryce.kahle@datadoghq.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/cace836e4d07bb63b1a53e49c5dfb238a040c298.1599512096.git.daniel@iogearbox.net
2020-09-08 09:16:12 -07:00
Jose M. Guisado Gomez
b131c96496 netfilter: nf_tables: add userdata support for nft_object
Enables storing userdata for nft_object. Initially this will store an
optional comment but can be extended in the future as needed.

Adds new attribute NFTA_OBJ_USERDATA to nft_object.

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 16:35:38 +02:00
Eric Dumazet
0ff4628f4c mac802154: tx: fix use-after-free
syzbot reported a bug in ieee802154_tx() [1]

A similar issue in ieee802154_xmit_worker() is also fixed in this patch.

[1]
BUG: KASAN: use-after-free in ieee802154_tx+0x3d2/0x480 net/mac802154/tx.c:88
Read of size 4 at addr ffff8880251a8c70 by task syz-executor.3/928

CPU: 0 PID: 928 Comm: syz-executor.3 Not tainted 5.9.0-rc3-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x198/0x1fd lib/dump_stack.c:118
 print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
 __kasan_report mm/kasan/report.c:513 [inline]
 kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
 ieee802154_tx+0x3d2/0x480 net/mac802154/tx.c:88
 ieee802154_subif_start_xmit+0xbe/0xe4 net/mac802154/tx.c:130
 __netdev_start_xmit include/linux/netdevice.h:4634 [inline]
 netdev_start_xmit include/linux/netdevice.h:4648 [inline]
 dev_direct_xmit+0x4e9/0x6e0 net/core/dev.c:4203
 packet_snd net/packet/af_packet.c:2989 [inline]
 packet_sendmsg+0x2413/0x5290 net/packet/af_packet.c:3014
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:671
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2353
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2407
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x45d5b9
Code: 5d b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b b4 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fc98e749c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 000000000002ccc0 RCX: 000000000045d5b9
RDX: 0000000000000000 RSI: 0000000020007780 RDI: 000000000000000b
RBP: 000000000118d020 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 000000000118cfec
R13: 00007fff690c720f R14: 00007fc98e74a9c0 R15: 000000000118cfec

Allocated by task 928:
 kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
 kasan_set_track mm/kasan/common.c:56 [inline]
 __kasan_kmalloc.constprop.0+0xbf/0xd0 mm/kasan/common.c:461
 slab_post_alloc_hook mm/slab.h:518 [inline]
 slab_alloc_node mm/slab.c:3254 [inline]
 kmem_cache_alloc_node+0x136/0x3e0 mm/slab.c:3574
 __alloc_skb+0x71/0x550 net/core/skbuff.c:198
 alloc_skb include/linux/skbuff.h:1094 [inline]
 alloc_skb_with_frags+0x92/0x570 net/core/skbuff.c:5771
 sock_alloc_send_pskb+0x72a/0x880 net/core/sock.c:2348
 packet_alloc_skb net/packet/af_packet.c:2837 [inline]
 packet_snd net/packet/af_packet.c:2932 [inline]
 packet_sendmsg+0x19fb/0x5290 net/packet/af_packet.c:3014
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:671
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2353
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2407
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Freed by task 928:
 kasan_save_stack+0x1b/0x40 mm/kasan/common.c:48
 kasan_set_track+0x1c/0x30 mm/kasan/common.c:56
 kasan_set_free_info+0x1b/0x30 mm/kasan/generic.c:355
 __kasan_slab_free+0xd8/0x120 mm/kasan/common.c:422
 __cache_free mm/slab.c:3418 [inline]
 kmem_cache_free.part.0+0x74/0x1e0 mm/slab.c:3693
 kfree_skbmem+0xef/0x1b0 net/core/skbuff.c:622
 __kfree_skb net/core/skbuff.c:679 [inline]
 consume_skb net/core/skbuff.c:838 [inline]
 consume_skb+0xcf/0x160 net/core/skbuff.c:832
 __dev_kfree_skb_any+0x9c/0xc0 net/core/dev.c:3107
 fakelb_hw_xmit+0x20e/0x2a0 drivers/net/ieee802154/fakelb.c:81
 drv_xmit_async net/mac802154/driver-ops.h:16 [inline]
 ieee802154_tx+0x282/0x480 net/mac802154/tx.c:81
 ieee802154_subif_start_xmit+0xbe/0xe4 net/mac802154/tx.c:130
 __netdev_start_xmit include/linux/netdevice.h:4634 [inline]
 netdev_start_xmit include/linux/netdevice.h:4648 [inline]
 dev_direct_xmit+0x4e9/0x6e0 net/core/dev.c:4203
 packet_snd net/packet/af_packet.c:2989 [inline]
 packet_sendmsg+0x2413/0x5290 net/packet/af_packet.c:3014
 sock_sendmsg_nosec net/socket.c:651 [inline]
 sock_sendmsg+0xcf/0x120 net/socket.c:671
 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2353
 ___sys_sendmsg+0xf3/0x170 net/socket.c:2407
 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2440
 do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

The buggy address belongs to the object at ffff8880251a8c00
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 112 bytes inside of
 224-byte region [ffff8880251a8c00, ffff8880251a8ce0)
The buggy address belongs to the page:
page:0000000062b6a4f1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x251a8
flags: 0xfffe0000000200(slab)
raw: 00fffe0000000200 ffffea0000435c88 ffffea00028b6c08 ffff8880a9055d00
raw: 0000000000000000 ffff8880251a80c0 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880251a8b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880251a8b80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc
>ffff8880251a8c00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                                             ^
 ffff8880251a8c80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff8880251a8d00: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb

Fixes: 409c3b0c5f ("mac802154: tx: move stats tx increment")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Alexander Aring <alex.aring@gmail.com>
Cc: Stefan Schmidt <stefan@datenfreihafen.org>
Cc: linux-wpan@vger.kernel.org
Link: https://lore.kernel.org/r/20200908104025.4009085-1-edumazet@google.com
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2020-09-08 16:35:32 +02:00
Pablo Neira Ayuso
0c92411bb8 netfilter: nft_meta: use socket user_ns to retrieve skuid and skgid
... instead of using init_user_ns.

Fixes: 96518518cc ("netfilter: add nftables")
Tested-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 13:04:56 +02:00
Eelco Chaudron
526e81b990 netfilter: conntrack: nf_conncount_init is failing with IPv6 disabled
The openvswitch module fails initialization when used in a kernel
without IPv6 enabled. nf_conncount_init() fails because the ct code
unconditionally tries to initialize the netns IPv6 related bit,
regardless of the build option. The change below ignores the IPv6
part if not enabled.

Note that the corresponding _put() function already has this IPv6
configuration check.

Fixes: 11efd5cb04 ("openvswitch: Support conntrack zone limit")
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 13:04:54 +02:00
Martin Willi
6c0d95d123 netfilter: ctnetlink: fix mark based dump filtering regression
conntrack mark based dump filtering may falsely skip entries if a mask
is given: If the mask-based check does not filter out the entry, the
else-if check is always true and compares the mark without considering
the mask. The if/else-if logic seems wrong.

Given that the mask during filter setup is implicitly set to 0xffffffff
if not specified explicitly, the mark filtering flags seem to just
complicate things. Restore the previously used approach by always
matching against a zero mask is no filter mark is given.

Fixes: cb8aa9a3af ("netfilter: ctnetlink: add kernel side filtering for dump")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 13:04:51 +02:00
Pablo Neira Ayuso
67cc570eda netfilter: nf_tables: coalesce multiple notifications into one skbuff
On x86_64, each notification results in one skbuff allocation which
consumes at least 768 bytes due to the skbuff overhead.

This patch coalesces several notifications into one single skbuff, so
each notification consumes at least ~211 bytes, that ~3.5 times less
memory consumption. As a result, this is reducing the chances to exhaust
the netlink socket receive buffer.

Rule of thumb is that each notification batch only contains netlink
messages whose report flag is the same, nfnetlink_send() requires this
to do appropriate delivery to userspace, either via unicast (echo
mode) or multicast (monitor mode).

The skbuff control buffer is used to annotate the report flag for later
handling at the new coalescing routine.

The batch skbuff notification size is NLMSG_GOODSIZE, using a larger
skbuff would allow for more socket receiver buffer savings (to amortize
the cost of the skbuff even more), however, going over that size might
break userspace applications, so let's be conservative and stick to
NLMSG_GOODSIZE.

Reported-by: Phil Sutter <phil@nwl.cc>
Acked-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 13:02:44 +02:00
Wang Hai
36c3be8a2c netfilter: ebt_stp: Remove unused macro BPDU_TYPE_TCN
BPDU_TYPE_TCN is never used after it was introduced.
So better to remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 12:56:38 +02:00
Will McVicker
1cc5ef91d2 netfilter: ctnetlink: add a range check for l3/l4 protonum
The indexes to the nf_nat_l[34]protos arrays come from userspace. So
check the tuple's family, e.g. l3num, when creating the conntrack in
order to prevent an OOB memory access during setup.  Here is an example
kernel panic on 4.14.180 when userspace passes in an index greater than
NFPROTO_NUMPROTO.

Internal error: Oops - BUG: 0 [#1] PREEMPT SMP
Modules linked in:...
Process poc (pid: 5614, stack limit = 0x00000000a3933121)
CPU: 4 PID: 5614 Comm: poc Tainted: G S      W  O    4.14.180-g051355490483
Hardware name: Qualcomm Technologies, Inc. SM8150 V2 PM8150 Google Inc. MSM
task: 000000002a3dfffe task.stack: 00000000a3933121
pc : __cfi_check_fail+0x1c/0x24
lr : __cfi_check_fail+0x1c/0x24
...
Call trace:
__cfi_check_fail+0x1c/0x24
name_to_dev_t+0x0/0x468
nfnetlink_parse_nat_setup+0x234/0x258
ctnetlink_parse_nat_setup+0x4c/0x228
ctnetlink_new_conntrack+0x590/0xc40
nfnetlink_rcv_msg+0x31c/0x4d4
netlink_rcv_skb+0x100/0x184
nfnetlink_rcv+0xf4/0x180
netlink_unicast+0x360/0x770
netlink_sendmsg+0x5a0/0x6a4
___sys_sendmsg+0x314/0x46c
SyS_sendmsg+0xb4/0x108
el0_svc_naked+0x34/0x38

This crash is not happening since 5.4+, however, ctnetlink still
allows for creating entries with unsupported layer 3 protocol number.

Fixes: c1d10adb4a ("[NETFILTER]: Add ctnetlink port for nf_conntrack")
Signed-off-by: Will McVicker <willmcvicker@google.com>
[pablo@netfilter.org: rebased original patch on top of nf.git]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-09-08 11:41:43 +02:00
Vladimir Oltean
4349abdb40 net: dsa: don't print non-fatal MTU error if not supported
Commit 72579e14a1 ("net: dsa: don't fail to probe if we couldn't set
the MTU") changed, for some reason, the "err && err != -EOPNOTSUPP"
check into a simple "err". This causes the MTU warning to be printed
even for drivers that don't have the MTU operations implemented.
Fix that.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 21:01:50 -07:00
Vladimir Oltean
c9ebf126f1 net: dsa: change PHY error message again
slave_dev->name is only populated at this stage if it was specified
through a label in the device tree. However that is not mandatory.
When it isn't, the error message looks like this:

[    5.037057] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d
[    5.044672] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d
[    5.052275] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d
[    5.059877] fsl_enetc 0000:00:00.2 eth2: error -19 setting up slave PHY for eth%d

which is especially confusing since the error gets printed on behalf of
the DSA master (fsl_enetc in this case).

Printing an error message that contains a valid reference to the DSA
port's name is difficult at this point in the initialization stage, so
at least we should print some info that is more reliable, even if less
user-friendly. That may be the driver name and the hardware port index.

After this change, the error is printed as:

[    6.051587] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 0
[    6.061192] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 1
[    6.070765] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 2
[    6.080324] mscc_felix 0000:00:00.5: error -19 setting up PHY for tree 0, switch 0, port 3

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 21:00:53 -07:00
Wang Hai
81365af13a rxrpc: Remove unused macro rxrpc_min_rtt_wlen
rxrpc_min_rtt_wlen is never used after it was introduced.
So better to remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 15:04:41 -07:00
Taehee Yoo
e1f469cd58 Revert "netns: don't disable BHs when locking "nsid_lock""
This reverts commit 8d7e5dee97.

To protect netns id, the nsid_lock is used when netns id is being
allocated and removed by peernet2id_alloc() and unhash_nsid().
The nsid_lock can be used in BH context but only spin_lock() is used
in this code.
Using spin_lock() instead of spin_lock_bh() can result in a deadlock in
the following scenario reported by the lockdep.
In order to avoid a deadlock, the spin_lock_bh() should be used instead
of spin_lock() to acquire nsid_lock.

Test commands:
    ip netns del nst
    ip netns add nst
    ip link add veth1 type veth peer name veth2
    ip link set veth1 netns nst
    ip netns exec nst ip link add name br1 type bridge vlan_filtering 1
    ip netns exec nst ip link set dev br1 up
    ip netns exec nst ip link set dev veth1 master br1
    ip netns exec nst ip link set dev veth1 up
    ip netns exec nst ip link add macvlan0 link br1 up type macvlan

Splat looks like:
[   33.615860][  T607] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[   33.617194][  T607] 5.9.0-rc1+ #665 Not tainted
[ ... ]
[   33.670615][  T607] Chain exists of:
[   33.670615][  T607]   &mc->mca_lock --> &bridge_netdev_addr_lock_key --> &net->nsid_lock
[   33.670615][  T607]
[   33.673118][  T607]  Possible interrupt unsafe locking scenario:
[   33.673118][  T607]
[   33.674599][  T607]        CPU0                    CPU1
[   33.675557][  T607]        ----                    ----
[   33.676516][  T607]   lock(&net->nsid_lock);
[   33.677306][  T607]                                local_irq_disable();
[   33.678517][  T607]                                lock(&mc->mca_lock);
[   33.679725][  T607]                                lock(&bridge_netdev_addr_lock_key);
[   33.681166][  T607]   <Interrupt>
[   33.681791][  T607]     lock(&mc->mca_lock);
[   33.682579][  T607]
[   33.682579][  T607]  *** DEADLOCK ***
[ ... ]
[   33.922046][  T607] stack backtrace:
[   33.922999][  T607] CPU: 3 PID: 607 Comm: ip Not tainted 5.9.0-rc1+ #665
[   33.924099][  T607] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[   33.925714][  T607] Call Trace:
[   33.926238][  T607]  dump_stack+0x78/0xab
[   33.926905][  T607]  check_irq_usage+0x70b/0x720
[   33.927708][  T607]  ? iterate_chain_key+0x60/0x60
[   33.928507][  T607]  ? check_path+0x22/0x40
[   33.929201][  T607]  ? check_noncircular+0xcf/0x180
[   33.930024][  T607]  ? __lock_acquire+0x1952/0x1f20
[   33.930860][  T607]  __lock_acquire+0x1952/0x1f20
[   33.931667][  T607]  lock_acquire+0xaf/0x3a0
[   33.932366][  T607]  ? peernet2id_alloc+0x3a/0x170
[   33.933147][  T607]  ? br_port_fill_attrs+0x54c/0x6b0 [bridge]
[   33.934140][  T607]  ? br_port_fill_attrs+0x5de/0x6b0 [bridge]
[   33.935113][  T607]  ? kvm_sched_clock_read+0x14/0x30
[   33.935974][  T607]  _raw_spin_lock+0x30/0x70
[   33.936728][  T607]  ? peernet2id_alloc+0x3a/0x170
[   33.937523][  T607]  peernet2id_alloc+0x3a/0x170
[   33.938313][  T607]  rtnl_fill_ifinfo+0xb5e/0x1400
[   33.939091][  T607]  rtmsg_ifinfo_build_skb+0x8a/0xf0
[   33.939953][  T607]  rtmsg_ifinfo_event.part.39+0x17/0x50
[   33.940863][  T607]  rtmsg_ifinfo+0x1f/0x30
[   33.941571][  T607]  __dev_notify_flags+0xa5/0xf0
[   33.942376][  T607]  ? __irq_work_queue_local+0x49/0x50
[   33.943249][  T607]  ? irq_work_queue+0x1d/0x30
[   33.943993][  T607]  ? __dev_set_promiscuity+0x7b/0x1a0
[   33.944878][  T607]  __dev_set_promiscuity+0x7b/0x1a0
[   33.945758][  T607]  dev_set_promiscuity+0x1e/0x50
[   33.946582][  T607]  br_port_set_promisc+0x1f/0x40 [bridge]
[   33.947487][  T607]  br_manage_promisc+0x8b/0xe0 [bridge]
[   33.948388][  T607]  __dev_set_promiscuity+0x123/0x1a0
[   33.949244][  T607]  __dev_set_rx_mode+0x68/0x90
[   33.950021][  T607]  dev_uc_add+0x50/0x60
[   33.950720][  T607]  macvlan_open+0x18e/0x1f0 [macvlan]
[   33.951601][  T607]  __dev_open+0xd6/0x170
[   33.952269][  T607]  __dev_change_flags+0x181/0x1d0
[   33.953056][  T607]  rtnl_configure_link+0x2f/0xa0
[   33.953884][  T607]  __rtnl_newlink+0x6b9/0x8e0
[   33.954665][  T607]  ? __lock_acquire+0x95d/0x1f20
[   33.955450][  T607]  ? lock_acquire+0xaf/0x3a0
[   33.956193][  T607]  ? is_bpf_text_address+0x5/0xe0
[   33.956999][  T607]  rtnl_newlink+0x47/0x70

Acked-by: Guillaume Nault <gnault@redhat.com>
Fixes: 8d7e5dee97 ("netns: don't disable BHs when locking "nsid_lock"")
Reported-by: syzbot+3f960c64a104eaa2c813@syzkaller.appspotmail.com
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 14:32:39 -07:00
Nikolay Aleksandrov
e12cec65b5 net: bridge: mcast: destroy all entries via gc
Since each entry type has timers that can be running simultaneously we need
to make sure that entries are not freed before their timers have finished.
In order to do that generalize the src gc work to mcast gc work and use a
callback to free the entries (mdb, port group or src).

v3: add IPv6 support
v2: force mcast gc on port del to make sure all port group timers have
    finished before freeing the bridge port

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:36 -07:00
Nikolay Aleksandrov
23550b8313 net: bridge: mcast: improve IGMPv3/MLDv2 query processing
When an IGMPv3/MLDv2 query is received and we're operating in such mode
then we need to avoid updating group timers if the suppress flag is set.
Also we should update only timers for groups in exclude mode.

v3: add IPv6/MLDv2 support

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:36 -07:00
Nikolay Aleksandrov
109865fe12 net: bridge: mcast: support for IGMPV3/MLDv2 BLOCK_OLD_SOURCES report
We already have all necessary helpers, so process IGMPV3/MLDv2
BLOCK_OLD_SOURCES as per the RFCs.

v3: add IPv6/MLDv2 support
v2: directly do flag bit operations

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:36 -07:00
Nikolay Aleksandrov
5bf1e00b68 net: bridge: mcast: support for IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report
In order to process IGMPV3/MLDv2 CHANGE_TO_INCLUDE/EXCLUDE report types we
need new helpers which allow us to mark entries based on their timer
state and to query only marked entries.

v3: add IPv6/MLDv2 support, fix other_query checks
v2: directly do flag bit operations

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
e6231bca6a net: bridge: mcast: support for IGMPV3/MLDv2 MODE_IS_INCLUDE/EXCLUDE report
In order to process IGMPV3/MLDv2_MODE_IS_INCLUDE/EXCLUDE report types we
need some new helpers which allow us to set/clear flags for all current
entries and later delete marked entries after the report sources have been
processed.

v3: add IPv6/MLDv2 support
v2: drop flag helpers and directly do flag bit operations

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
0436862e41 net: bridge: mcast: support for IGMPv3/MLDv2 ALLOW_NEW_SOURCES report
This patch adds handling for the ALLOW_NEW_SOURCES IGMPv3/MLDv2 report
types and limits them only when multicast_igmp_version == 3 or
multicast_mld_version == 2 respectively. Now that IGMPv3/MLDv2 handling
functions will be managing timers we need to delay their activation, thus
a new argument is added which controls if the timer should be updated.
We also disable host IGMPv3/MLDv2 handling as it's not yet implemented and
could cause inconsistent group state, the host can only join a group as
EXCLUDE {} or leave it.

v4: rename update_timer to igmpv2_mldv1 and use the passed value from
    br_multicast_add_group's callers
v3: Add IPv6/MLDv2 support

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
d6c33d67a8 net: bridge: mcast: delete expired port groups without srcs
If an expired port group is in EXCLUDE mode, then we have to turn it
into INCLUDE mode, remove all srcs with zero timer and finally remove
the group itself if there are no more srcs with an active timer.
For IGMPv2 use there would be no sources, so this will reduce to just
removing the group as before.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
81f1983852 net: bridge: mdb: use mdb and port entries in notifications
We have to use mdb and port entries when sending mdb notifications in
order to fill in all group attributes properly. Before this change we
would've used a fake br_mdb_entry struct to fill in only partial
information about the mdb. Now we can also reuse the mdb dump fill
function and thus have only a single central place which fills the mdb
attributes.

v3: add IPv6 support

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
79abc87505 net: bridge: mdb: push notifications in __br_mdb_add/del
This change is in preparation for using the mdb port group entries when
sending a notification, so their full state and additional attributes can
be filled in.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
42c11ccfe8 net: bridge: mcast: add support for group query retransmit
We need to be able to retransmit group-specific and group-and-source
specific queries. The new timer takes care of those.

v3: add IPv6 support

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:35 -07:00
Nikolay Aleksandrov
438ef2d027 net: bridge: mcast: add support for group-and-source specific queries
Allows br_multicast_alloc_query to build queries with the port group's
source lists and sends a query for sources over and under lmqt when
necessary as per RFCs 3376 and 3810 with the suppress flag set
appropriately.

v3: add IPv6 support

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:34 -07:00
Nikolay Aleksandrov
5205e919c9 net: bridge: mcast: add support for src list and filter mode dumping
Support per port group src list (address and timer) and filter mode
dumping. Protected by either multicast_lock or rcu.

v3: add IPv6 support
v2: require RCU or multicast_lock to traverse src groups

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:34 -07:00
Nikolay Aleksandrov
8b671779b7 net: bridge: mcast: add support for group source list
Initial functions for group source lists which are needed for IGMPv3
and MLDv2 include/exclude lists. Both IPv4 and IPv6 sources are supported.
User-added mdb entries are created with exclude filter mode, we can
extend that later to allow user-supplied mode. When group src entries
are deleted, they're freed from a workqueue to make sure their timers
are not still running. Source entries are protected by the multicast_lock
and rcu. The number of src groups per port group is limited to 32.

v4: use the new port group del function directly
    add igmpv2/mldv1 bool to denote if the entry was added in those
    modes, it will later replace the old update_timer bool
v3: add IPv6 support
v2: allow src groups to be traversed under rcu

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:34 -07:00
Nikolay Aleksandrov
681590bd4c net: bridge: mcast: factor out port group del
In order to avoid future errors and reduce code duplication we should
factor out the port group del sequence. This allows us to have one
function which takes care of all details when removing a port group.

v4: set pg's fast leave flag when deleting due to fast leave
    move the patch before adding source lists

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:34 -07:00
Nikolay Aleksandrov
6ec0d0ee66 net: bridge: mdb: arrange internal structs so fast-path fields are close
Before this patch we'd need 2 cache lines for fast-path, now all used
fields are in the first cache line.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 13:16:34 -07:00
Hoang Huu Le
d966ddcc38 tipc: fix a deadlock when flushing scheduled work
In the commit fdeba99b1e
("tipc: fix use-after-free in tipc_bcast_get_mode"), we're trying
to make sure the tipc_net_finalize_work work item finished if it
enqueued. But calling flush_scheduled_work() is not just affecting
above work item but either any scheduled work. This has turned out
to be overkill and caused to deadlock as syzbot reported:

======================================================
WARNING: possible circular locking dependency detected
5.9.0-rc2-next-20200828-syzkaller #0 Not tainted
------------------------------------------------------
kworker/u4:6/349 is trying to acquire lock:
ffff8880aa063d38 ((wq_completion)events){+.+.}-{0:0}, at: flush_workqueue+0xe1/0x13e0 kernel/workqueue.c:2777

but task is already holding lock:
ffffffff8a879430 (pernet_ops_rwsem){++++}-{3:3}, at: cleanup_net+0x9b/0xb10 net/core/net_namespace.c:565

[...]
 Possible unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  lock(pernet_ops_rwsem);
                               lock(&sb->s_type->i_mutex_key#13);
                               lock(pernet_ops_rwsem);
  lock((wq_completion)events);

 *** DEADLOCK ***
[...]

v1:
To fix the original issue, we replace above calling by introducing
a bit flag. When a namespace cleaned-up, bit flag is set to zero and:
- tipc_net_finalize functionial just does return immediately.
- tipc_net_finalize_work does not enqueue into the scheduled work queue.

v2:
Use cancel_work_sync() helper to make sure ONLY the
tipc_net_finalize_work() stopped before releasing bcbase object.

Reported-by: syzbot+d5aa7e0385f6a5d0f4fd@syzkaller.appspotmail.com
Fixes: fdeba99b1e ("tipc: fix use-after-free in tipc_bcast_get_mode")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-07 12:08:53 -07:00
Antony Antony
8366685b28 xfrm: clone whole liftime_cur structure in xfrm_do_migrate
When we clone state only add_time was cloned. It missed values like
bytes, packets.  Now clone the all members of the structure.

v1->v3:
 - use memcpy to copy the entire structure

Fixes: 80c9abaabf ("[XFRM]: Extension for dynamic update of endpoint address(es)")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-09-07 12:45:22 +02:00
Antony Antony
7aa05d3047 xfrm: clone XFRMA_SEC_CTX in xfrm_do_migrate
XFRMA_SEC_CTX was not cloned from the old to the new.
Migrate this attribute during XFRMA_MSG_MIGRATE

v1->v2:
 - return -ENOMEM on error
v2->v3:
 - fix return type to int

Fixes: 80c9abaabf ("[XFRM]: Extension for dynamic update of endpoint address(es)")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-09-07 12:45:22 +02:00
Antony Antony
545e5c5716 xfrm: clone XFRMA_SET_MARK in xfrm_do_migrate
XFRMA_SET_MARK and XFRMA_SET_MARK_MASK was not cloned from the old
to the new. Migrate these two attributes during XFRMA_MSG_MIGRATE

Fixes: 9b42c1f179 ("xfrm: Extend the output_mark to support input direction and masking.")
Signed-off-by: Antony Antony <antony.antony@secunet.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-09-07 12:45:22 +02:00
Jonathan Neuschäfer
ee1a4c84a7 net: Add a missing word
Signed-off-by: Jonathan Neuschäfer <j.neuschaefer@gmx.net>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-06 12:13:11 -07:00
Wang Hai
383e3f3ee8 net/packet: Remove unused macro BLOCK_PRIV
BLOCK_PRIV is never used after it was introduced.
So better to remove it.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-06 10:17:34 -07:00
Wang Hai
be239c4d5e NFC: digital: Remove two unused macroes
DIGITAL_NFC_DEP_REQ_RES_TAILROOM is never used after it was introduced.
DIGITAL_NFC_DEP_REQ_RES_HEADROOM is no more used after below
commit e8e7f42175 ("NFC: digital: Remove useless call to skb_reserve()")
Remove them.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-05 16:01:52 -07:00
Wang Hai
877c347402 caif: Remove duplicate macro SRVL_CTRL_PKT_SIZE
Remove SRVL_CTRL_PKT_SIZE which is defined more than once.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-05 15:57:05 -07:00
J. Bruce Fields
8c6b6c793e SUNRPC: stop printk reading past end of string
Since p points at raw xdr data, there's no guarantee that it's NULL
terminated, so we should give a length.  And probably escape any special
characters too.

Reported-by: Zhi Li <yieli@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-09-05 10:39:41 -04:00
Linus Lüssing
7dda5b3384 batman-adv: mcast/TT: fix wrongly dropped or rerouted packets
The unicast packet rerouting code makes several assumptions. For
instance it assumes that there is always exactly one destination in the
TT. This breaks for multicast frames in a unicast packets in several ways:

For one thing if there is actually no TT entry and the destination node
was selected due to the multicast tvlv flags it announced. Then an
intermediate node will wrongly drop the packet.

For another thing if there is a TT entry but the TTVN of this entry is
newer than the originally addressed destination node: Then the
intermediate node will wrongly redirect the packet, leading to
duplicated multicast packets at a multicast listener and missing
packets at other multicast listeners or multicast routers.

Fixing this by not applying the unicast packet rerouting to batman-adv
unicast packets with a multicast payload. We are not able to detect a
roaming multicast listener at the moment and will just continue to send
the multicast frame to both the new and old destination for a while in
case of such a roaming multicast listener.

Fixes: a73105b8d4 ("batman-adv: improved client announcement mechanism")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-09-05 08:45:46 +02:00
Cong Wang
cc8e58f832 act_ife: load meta modules before tcf_idr_check_alloc()
The following deadlock scenario is triggered by syzbot:

Thread A:				Thread B:
tcf_idr_check_alloc()
...
populate_metalist()
  rtnl_unlock()
					rtnl_lock()
					...
  request_module()			tcf_idr_check_alloc()
  rtnl_lock()

At this point, thread A is waiting for thread B to release RTNL
lock, while thread B is waiting for thread A to commit the IDR
change with tcf_idr_insert() later.

Break this deadlock situation by preloading ife modules earlier,
before tcf_idr_check_alloc(), this is fine because we only need
to load modules we need potentially.

Reported-and-tested-by: syzbot+80e32b5d1f9923f8ace6@syzkaller.appspotmail.com
Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 22:08:59 -07:00
Jakub Kicinski
44a8c4f33c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
We got slightly different patches removing a double word
in a comment in net/ipv4/raw.c - picked the version from net.

Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached
values instead of VNIC login response buffer (following what
commit 507ebe6444 ("ibmvnic: Fix use-after-free of VNIC login
response buffer") did).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-09-04 21:28:59 -07:00
Or Cohen
acf69c9462 net/packet: fix overflow in tpacket_rcv
Using tp_reserve to calculate netoff can overflow as
tp_reserve is unsigned int and netoff is unsigned short.

This may lead to macoff receving a smaller value then
sizeof(struct virtio_net_hdr), and if po->has_vnet_hdr
is set, an out-of-bounds write will occur when
calling virtio_net_hdr_from_skb.

The bug is fixed by converting netoff to unsigned int
and checking if it exceeds USHRT_MAX.

This addresses CVE-2020-14386

Fixes: 8913336a7e ("packet: add PACKET_RESERVE sockopt")
Signed-off-by: Or Cohen <orcohen@paloaltonetworks.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-09-04 11:56:02 -07:00
Linus Torvalds
3e8d3bdc2a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi
    Kivilinna.

 2) Fix loss of RTT samples in rxrpc, from David Howells.

 3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu.

 4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka.

 5) We disable BH for too lokng in sctp_get_port_local(), add a
    cond_resched() here as well, from Xin Long.

 6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu.

 7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from
    Yonghong Song.

 8) Missing of_node_put() in mt7530 DSA driver, from Sumera
    Priyadarsini.

 9) Fix crash in bnxt_fw_reset_task(), from Michael Chan.

10) Fix geneve tunnel checksumming bug in hns3, from Yi Li.

11) Memory leak in rxkad_verify_response, from Dinghao Liu.

12) In tipc, don't use smp_processor_id() in preemptible context. From
    Tuong Lien.

13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu.

14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter.

15) Fix ABI mismatch between driver and firmware in nfp, from Louis
    Peens.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits)
  net/smc: fix sock refcounting in case of termination
  net/smc: reset sndbuf_desc if freed
  net/smc: set rx_off for SMCR explicitly
  net/smc: fix toleration of fake add_link messages
  tg3: Fix soft lockup when tg3_reset_task() fails.
  doc: net: dsa: Fix typo in config code sample
  net: dp83867: Fix WoL SecureOn password
  nfp: flower: fix ABI mismatch between driver and firmware
  tipc: fix shutdown() of connectionless socket
  ipv6: Fix sysctl max for fib_multipath_hash_policy
  drivers/net/wan/hdlc: Change the default of hard_header_len to 0
  net: gemini: Fix another missing clk_disable_unprepare() in probe
  net: bcmgenet: fix mask check in bcmgenet_validate_flow()
  amd-xgbe: Add support for new port mode
  net: usb: dm9601: Add USB ID of Keenetic Plus DSL
  vhost: fix typo in error message
  net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init()
  pktgen: fix error message with wrong function name
  net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode
  cxgb4: fix thermal zone device registration
  ...
2020-09-03 18:50:48 -07:00
Ursula Braun
5fb8642a17 net/smc: fix sock refcounting in case of termination
When an ISM device is removed, all its linkgroups are terminated,
i.e. all the corresponding connections are killed.
Connection killing invokes smc_close_active_abort(), which decreases
the sock refcount for certain states to simulate passive closing.
And it cancels the close worker and has to give up the sock lock for
this timeframe. This opens the door for a passive close worker or a
socket close to run in between. In this case smc_close_active_abort() and
passive close worker resp. smc_release() might do a sock_put for passive
closing. This causes:

[ 1323.315943] refcount_t: underflow; use-after-free.
[ 1323.316055] WARNING: CPU: 3 PID: 54469 at lib/refcount.c:28 refcount_warn_saturate+0xe8/0x130
[ 1323.316069] Kernel panic - not syncing: panic_on_warn set ...
[ 1323.316084] CPU: 3 PID: 54469 Comm: uperf Not tainted 5.9.0-20200826.rc2.git0.46328853ed20.300.fc32.s390x+debug #1
[ 1323.316096] Hardware name: IBM 2964 NC9 702 (z/VM 6.4.0)
[ 1323.316108] Call Trace:
[ 1323.316125]  [<00000000c0d4aae8>] show_stack+0x90/0xf8
[ 1323.316143]  [<00000000c15989b0>] dump_stack+0xa8/0xe8
[ 1323.316158]  [<00000000c0d8344e>] panic+0x11e/0x288
[ 1323.316173]  [<00000000c0d83144>] __warn+0xac/0x158
[ 1323.316187]  [<00000000c1597a7a>] report_bug+0xb2/0x130
[ 1323.316201]  [<00000000c0d36424>] monitor_event_exception+0x44/0xc0
[ 1323.316219]  [<00000000c195c716>] pgm_check_handler+0x1da/0x238
[ 1323.316234]  [<00000000c151844c>] refcount_warn_saturate+0xec/0x130
[ 1323.316280] ([<00000000c1518448>] refcount_warn_saturate+0xe8/0x130)
[ 1323.316310]  [<000003ff801f2e2a>] smc_release+0x192/0x1c8 [smc]
[ 1323.316323]  [<00000000c169f1fa>] __sock_release+0x5a/0xe0
[ 1323.316334]  [<00000000c169f2ac>] sock_close+0x2c/0x40
[ 1323.316350]  [<00000000c1086de0>] __fput+0xb8/0x278
[ 1323.316362]  [<00000000c0db1e0e>] task_work_run+0x76/0xb8
[ 1323.316393]  [<00000000c0d8ab84>] do_exit+0x26c/0x520
[ 1323.316408]  [<00000000c0d8af08>] do_group_exit+0x48/0xc0
[ 1323.316421]  [<00000000c0d8afa8>] __s390x_sys_exit_group+0x28/0x38
[ 1323.316433]  [<00000000c195c32c>] system_call+0xe0/0x2b4
[ 1323.316446] 1 lock held by uperf/54469:
[ 1323.316456]  #0: 0000000044125e60 (&sb->s_type->i_mutex_key#9){+.+.}-{3:3}, at: __sock_release+0x44/0xe0

The patch rechecks sock state in smc_close_active_abort() after
smc_close_cancel_work() to avoid duplicate decrease of sock
refcount for the same purpose.

Fixes: 611b63a127 ("net/smc: cancel tx worker in case of socket aborts")
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 16:52:33 -07:00
Ursula Braun
1d8df41d89 net/smc: reset sndbuf_desc if freed
When an SMC connection is created, and there is a problem to
create an RMB or DMB, the previously created send buffer is
thrown away as well including buffer descriptor freeing.
Make sure the connection no longer references the freed
buffer descriptor, otherwise bugs like this are possible:

[71556.835148] =============================================================================
[71556.835168] BUG kmalloc-128 (Tainted: G    B      OE    ): Poison overwritten
[71556.835172] -----------------------------------------------------------------------------

[71556.835179] INFO: 0x00000000d20894be-0x00000000aaef63e9 @offset=2724. First byte 0x0 instead of 0x6b
[71556.835215] INFO: Allocated in __smc_buf_create+0x184/0x578 [smc] age=0 cpu=5 pid=46726
[71556.835234]     ___slab_alloc+0x5a4/0x690
[71556.835239]     __slab_alloc.constprop.0+0x70/0xb0
[71556.835243]     kmem_cache_alloc_trace+0x38e/0x3f8
[71556.835250]     __smc_buf_create+0x184/0x578 [smc]
[71556.835257]     smc_buf_create+0x2e/0xe8 [smc]
[71556.835264]     smc_listen_work+0x516/0x6a0 [smc]
[71556.835275]     process_one_work+0x280/0x478
[71556.835280]     worker_thread+0x66/0x368
[71556.835287]     kthread+0x17a/0x1a0
[71556.835294]     ret_from_fork+0x28/0x2c
[71556.835301] INFO: Freed in smc_buf_create+0xd8/0xe8 [smc] age=0 cpu=5 pid=46726
[71556.835307]     __slab_free+0x246/0x560
[71556.835311]     kfree+0x398/0x3f8
[71556.835318]     smc_buf_create+0xd8/0xe8 [smc]
[71556.835324]     smc_listen_work+0x516/0x6a0 [smc]
[71556.835328]     process_one_work+0x280/0x478
[71556.835332]     worker_thread+0x66/0x368
[71556.835337]     kthread+0x17a/0x1a0
[71556.835344]     ret_from_fork+0x28/0x2c
[71556.835348] INFO: Slab 0x00000000a0744551 objects=51 used=51 fp=0x0000000000000000 flags=0x1ffff00000010200
[71556.835352] INFO: Object 0x00000000563480a1 @offset=2688 fp=0x00000000289567b2

[71556.835359] Redzone 000000006783cde2: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835363] Redzone 00000000e35b876e: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835367] Redzone 0000000023074562: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835372] Redzone 00000000b9564b8c: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835376] Redzone 00000000810c6362: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835380] Redzone 0000000065ef52c3: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835384] Redzone 00000000c5dd6984: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835388] Redzone 000000004c480f8f: bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb bb  ................
[71556.835392] Object 00000000563480a1: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[71556.835397] Object 000000009c479d06: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[71556.835401] Object 000000006e1dce92: 6b 6b 6b 6b 00 00 00 00 6b 6b 6b 6b 6b 6b 6b 6b  kkkk....kkkkkkkk
[71556.835405] Object 00000000227f7cf8: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[71556.835410] Object 000000009a701215: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[71556.835414] Object 000000003731ce76: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[71556.835418] Object 00000000f7085967: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
[71556.835422] Object 0000000007f99927: 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b a5  kkkkkkkkkkkkkkk.
[71556.835427] Redzone 00000000579c4913: bb bb bb bb bb bb bb bb                          ........
[71556.835431] Padding 00000000305aef82: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[71556.835435] Padding 00000000b1cdd722: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[71556.835438] Padding 00000000c7568199: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[71556.835442] Padding 00000000fad4c4d4: 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a 5a  ZZZZZZZZZZZZZZZZ
[71556.835451] CPU: 0 PID: 47939 Comm: kworker/0:15 Tainted: G    B      OE     5.9.0-rc1uschi+ #54
[71556.835456] Hardware name: IBM 3906 M03 703 (LPAR)
[71556.835464] Workqueue: events smc_listen_work [smc]
[71556.835470] Call Trace:
[71556.835478]  [<00000000d5eaeb10>] show_stack+0x90/0xf8
[71556.835493]  [<00000000d66fc0f8>] dump_stack+0xa8/0xe8
[71556.835499]  [<00000000d61a511c>] check_bytes_and_report+0x104/0x130
[71556.835504]  [<00000000d61a57b2>] check_object+0x26a/0x2e0
[71556.835509]  [<00000000d61a59bc>] alloc_debug_processing+0x194/0x238
[71556.835514]  [<00000000d61a8c14>] ___slab_alloc+0x5a4/0x690
[71556.835519]  [<00000000d61a9170>] __slab_alloc.constprop.0+0x70/0xb0
[71556.835524]  [<00000000d61aaf66>] kmem_cache_alloc_trace+0x38e/0x3f8
[71556.835530]  [<000003ff80549bbc>] __smc_buf_create+0x184/0x578 [smc]
[71556.835538]  [<000003ff8054a396>] smc_buf_create+0x2e/0xe8 [smc]
[71556.835545]  [<000003ff80540c16>] smc_listen_work+0x516/0x6a0 [smc]
[71556.835549]  [<00000000d5f0f448>] process_one_work+0x280/0x478
[71556.835554]  [<00000000d5f0f6a6>] worker_thread+0x66/0x368
[71556.835559]  [<00000000d5f18692>] kthread+0x17a/0x1a0
[71556.835563]  [<00000000d6abf3b8>] ret_from_fork+0x28/0x2c
[71556.835569] INFO: lockdep is turned off.
[71556.835573] FIX kmalloc-128: Restoring 0x00000000d20894be-0x00000000aaef63e9=0x6b

[71556.835577] FIX kmalloc-128: Marking all objects used

Fixes: fd7f3a7465 ("net/smc: remove freed buffer from list")
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 16:52:33 -07:00
Ursula Braun
2d2bfeb8c5 net/smc: set rx_off for SMCR explicitly
SMC tries to make use of SMCD first. If a problem shows up,
it tries to switch to SMCR. If the SMCD initializing problem shows
up after the SMCD connection has already been initialized, field
rx_off keeps the wrong SMCD value for SMCR, which results in corrupted
data at the receiver.
This patch adds an explicit (re-)setting of field rx_off to zero if the
connection uses SMCR.

Fixes: be244f28d2 ("net/smc: add SMC-D support in data transfer")
Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 16:52:33 -07:00
Karsten Graul
fffe83c8c4 net/smc: fix toleration of fake add_link messages
Older SMCR implementations had no link failover support and used one
link only. Because the handshake protocol requires to try the
establishment of a second link the old code sent a fake add_link message
and declined any server response afterwards.
The current code supports multiple links and inspects the received fake
add_link message more closely. To tolerate the fake add_link messages
smc_llc_is_local_add_link() needs an improved check of the message to
be able to separate between locally enqueued and fake add_link messages.
And smc_llc_cli_add_link() needs to check if the provided qp_mtu size is
invalid and reject the add_link request in that case.

Fixes: c48254fa48 ("net/smc: move add link processing for new device into llc layer")
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 16:52:33 -07:00
Wei Wang
c107761614 ip: expose inet sockopts through inet_diag
Expose all exisiting inet sockopt bits through inet_diag for debug purpose.
Corresponding changes in iproute2 ss will be submitted to output all
these values.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 15:17:28 -07:00
Edward Cree
2adc6edcae ethtool: fix error handling in ethtool_phys_id
If ops->set_phys_id() returned an error, previously we would only break
 out of the inner loop, which neither stopped the outer loop nor returned
 the error to the user (since 'rc' would be overwritten on the next pass
 through the loop).
Thus, rewrite it to use a single loop, so that the break does the right
 thing.  Use u64 for 'count' and 'i' to prevent overflow in case of
 (unreasonably) large values of id.data and n.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 14:59:51 -07:00
Tom Parkin
9d319a8e93 l2tp: avoid duplicated code in l2tp_tunnel_closeall
l2tp_tunnel_closeall is called as a part of tunnel shutdown in order to
close all the sessions held by the tunnel.  The code it uses to close a
session duplicates what l2tp_session_delete does.

Rather than duplicating the code, have l2tp_tunnel_closeall call
l2tp_session_delete instead.

This involves a very minor change to locking in l2tp_tunnel_closeall.
Previously, l2tp_tunnel_closeall checked the session "dead" flag while
holding tunnel->hlist_lock.  This allowed for the code to step to the
next session in the list without releasing the lock if the current
session happened to be in the process of closing already.

By calling l2tp_session_delete instead, l2tp_tunnel_closeall must now
drop and regain the hlist lock for each session in the tunnel list.
Given that the likelihood of a session being in the process of closing
when the tunnel is closed, it seems worth this very minor potential
loss of efficiency to avoid duplication of the session delete code.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 12:19:03 -07:00
Tom Parkin
45faeff11b l2tp: make magic feather checks more useful
The l2tp tunnel and session structures contain a "magic feather" field
which was originally intended to help trace lifetime bugs in the code.

Since the introduction of the shared kernel refcount code in refcount.h,
and l2tp's porting to those APIs, we are covered by the refcount code's
checks and warnings.  Duplicating those checks in the l2tp code isn't
useful.

However, magic feather checks are still useful to help to detect bugs
stemming from misuse/trampling of the sk_user_data pointer in struct
sock.  The l2tp code makes extensive use of sk_user_data to stash
pointers to the tunnel and session structures, and if another subsystem
overwrites sk_user_data it's important to detect this.

As such, rework l2tp's magic feather checks to focus on validating the
tunnel and session data structures when they're extracted from
sk_user_data.

 * Add a new accessor function l2tp_sk_to_tunnel which contains a magic
   feather check, and is used by l2tp_core and l2tp_ip[6]
 * Comment l2tp_udp_encap_recv which doesn't use this new accessor function
   because of the specific nature of the codepath it is called in
 * Drop l2tp_session_queue_purge's check on the session magic feather:
   it is called from code which is walking the tunnel session list, and
   hence doesn't need validation
 * Drop l2tp_session_free's check on the tunnel magic feather: the
   intention of this check is covered by refcount.h's reference count
   sanity checking
 * Add session magic validation in pppol2tp_ioctl.  On failure return
   -EBADF, which mirrors the approach in pppol2tp_[sg]etsockopt.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 12:19:03 -07:00
Tom Parkin
de68b039e9 l2tp: capture more tx errors in data plane stats
l2tp_xmit_skb has a number of failure paths which are not reflected in
the tunnel and session statistics because the stats are updated by
l2tp_xmit_core.  Hence any errors occurring before l2tp_xmit_core is
called are missed from the statistics.

Refactor the transmit path slightly to capture all error paths.

l2tp_xmit_skb now leaves all the actual work of transmission to
l2tp_xmit_core, and updates the statistics based on l2tp_xmit_core's
return code.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 12:19:03 -07:00
Tom Parkin
c9ccd4c63c l2tp: drop net argument from l2tp_tunnel_create
The argument is unused, so remove it.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 12:19:03 -07:00
Tom Parkin
039bca78cb l2tp: drop data_len argument from l2tp_xmit_core
The data_len argument passed to l2tp_xmit_core is no longer used, so
remove it.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 12:19:03 -07:00
Tom Parkin
efe0527882 l2tp: remove header length param from l2tp_xmit_skb
All callers pass the session structure's hdr_len field as the header
length parameter to l2tp_xmit_skb.

Since we're passing a pointer to the session structure to l2tp_xmit_skb
anyway, there's not much point breaking the header length out as a
separate argument.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-03 12:19:03 -07:00
Tetsuo Handa
2a63866c8b tipc: fix shutdown() of connectionless socket
syzbot is reporting hung task at nbd_ioctl() [1], for there are two
problems regarding TIPC's connectionless socket's shutdown() operation.

----------
#include <fcntl.h>
#include <sys/socket.h>
#include <sys/ioctl.h>
#include <linux/nbd.h>
#include <unistd.h>

int main(int argc, char *argv[])
{
        const int fd = open("/dev/nbd0", 3);
        alarm(5);
        ioctl(fd, NBD_SET_SOCK, socket(PF_TIPC, SOCK_DGRAM, 0));
        ioctl(fd, NBD_DO_IT, 0); /* To be interrupted by SIGALRM. */
        return 0;
}
----------

One problem is that wait_for_completion() from flush_workqueue() from
nbd_start_device_ioctl() from nbd_ioctl() cannot be completed when
nbd_start_device_ioctl() received a signal at wait_event_interruptible(),
for tipc_shutdown() from kernel_sock_shutdown(SHUT_RDWR) from
nbd_mark_nsock_dead() from sock_shutdown() from nbd_start_device_ioctl()
is failing to wake up a WQ thread sleeping at wait_woken() from
tipc_wait_for_rcvmsg() from sock_recvmsg() from sock_xmit() from
nbd_read_stat() from recv_work() scheduled by nbd_start_device() from
nbd_start_device_ioctl(). Fix this problem by always invoking
sk->sk_state_change() (like inet_shutdown() does) when tipc_shutdown() is
called.

The other problem is that tipc_wait_for_rcvmsg() cannot return when
tipc_shutdown() is called, for tipc_shutdown() sets sk->sk_shutdown to
SEND_SHUTDOWN (despite "how" is SHUT_RDWR) while tipc_wait_for_rcvmsg()
needs sk->sk_shutdown set to RCV_SHUTDOWN or SHUTDOWN_MASK. Fix this
problem by setting sk->sk_shutdown to SHUTDOWN_MASK (like inet_shutdown()
does) when the socket is connectionless.

[1] https://syzkaller.appspot.com/bug?id=3fe51d307c1f0a845485cf1798aa059d12bf18b2

Reported-by: syzbot <syzbot+e36f41d207137b5d12f7@syzkaller.appspotmail.com>
Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-02 15:49:30 -07:00
Ido Schimmel
05d4487197 ipv6: Fix sysctl max for fib_multipath_hash_policy
Cited commit added the possible value of '2', but it cannot be set. Fix
it by adjusting the maximum value to '2'. This is consistent with the
corresponding IPv4 sysctl.

Before:

# sysctl -w net.ipv6.fib_multipath_hash_policy=2
sysctl: setting key "net.ipv6.fib_multipath_hash_policy": Invalid argument
net.ipv6.fib_multipath_hash_policy = 2
# sysctl net.ipv6.fib_multipath_hash_policy
net.ipv6.fib_multipath_hash_policy = 0

After:

# sysctl -w net.ipv6.fib_multipath_hash_policy=2
net.ipv6.fib_multipath_hash_policy = 2
# sysctl net.ipv6.fib_multipath_hash_policy
net.ipv6.fib_multipath_hash_policy = 2

Fixes: d8f74f0975 ("ipv6: Support multipath hashing on inner IP pkts")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-02 15:44:53 -07:00
Magnus Karlsson
83cf5c68d6 xsk: Fix use-after-free in failed shared_umem bind
Fix use-after-free when a shared umem bind fails. The code incorrectly
tried to free the allocated buffer pool both in the bind code and then
later also when the socket was released. Fix this by setting the
buffer pool pointer to NULL after the bind code has freed the pool, so
that the socket release code will not try to free the pool. This is
the same solution as the regular, non-shared umem code path has. This
was missing from the shared umem path.

Fixes: b5aea28dca ("xsk: Add shared umem support between queue ids")
Reported-by: syzbot+5334f62e4d22804e646a@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1599032164-25684-1-git-send-email-magnus.karlsson@intel.com
2020-09-02 23:37:19 +02:00
Gustavo A. R. Silva
1d6fd78a21 xsk: Fix null check on error return path
Currently, dma_map is being checked, when the right object identifier
to be null-checked is dma_map->dma_pages, instead.

Fix this by null-checking dma_map->dma_pages.

Fixes: 921b68692a ("xsk: Enable sharing of dma mappings")
Addresses-Coverity-ID: 1496811 ("Logically dead code")
Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/20200902150750.GA7257@embeddedor
2020-09-02 20:31:50 +02:00
Magnus Karlsson
968be23cea xsk: Fix possible segfault at xskmap entry insertion
Fix possible segfault when entry is inserted into xskmap. This can
happen if the socket is in a state where the umem has been set up, the
Rx ring created but it has yet to be bound to a device. In this case
the pool has not yet been created and we cannot reference it for the
existence of the fill ring. Fix this by removing the whole
xsk_is_setup_for_bpf_map function. Once upon a time, it was used to
make sure that the Rx and fill rings where set up before the driver
could call xsk_rcv, since there are no tests for the existence of
these rings in the data path. But these days, we have a state variable
that we test instead. When it is XSK_BOUND, everything has been set up
correctly and the socket has been bound. So no reason to have the
xsk_is_setup_for_bpf_map function anymore.

Fixes: 7361f9c3d7 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+febe51d44243fbc564ee@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1599037569-26690-1-git-send-email-magnus.karlsson@intel.com
2020-09-02 16:52:59 +02:00
Magnus Karlsson
53ea2076d8 xsk: Fix possible segfault in xsk umem diagnostics
Fix possible segfault in the xsk diagnostics code when dumping
information about the umem. This can happen when a umem has been
created, but the socket has not been bound yet. In this case, the xsk
buffer pool does not exist yet and we cannot dump the information
that was moved from the umem to the buffer pool. Fix this by testing
for the existence of the buffer pool and if not there, do not dump any
of that information.

Fixes: c2d3d6a474 ("xsk: Move queue_id, dev and need_wakeup to buffer pool")
Fixes: 7361f9c3d7 ("xsk: Move fill and completion rings to buffer pool")
Reported-by: syzbot+3f04d36b7336f7868066@syzkaller.appspotmail.com
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/1599036743-26454-1-git-send-email-magnus.karlsson@intel.com
2020-09-02 16:49:40 +02:00
Eelco Chaudron
e0afe91443 net: openvswitch: fixes crash if nf_conncount_init() fails
If nf_conncount_init fails currently the dispatched work is not canceled,
causing problems when the timer fires. This change fixes this by not
scheduling the work until all initialization is successful.

Fixes: a65878d6f0 ("net: openvswitch: fixes potential deadlock in dp cleanup code")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Reviewed-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 13:23:23 -07:00
David S. Miller
150f29f5e6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-09-01

The following pull-request contains BPF updates for your *net-next* tree.

There are two small conflicts when pulling, resolve as follows:

1) Merge conflict in tools/lib/bpf/libbpf.c between 88a8212028 ("libbpf: Factor
   out common ELF operations and improve logging") in bpf-next and 1e891e513e
   ("libbpf: Fix map index used in error message") in net-next. Resolve by taking
   the hunk in bpf-next:

        [...]
        scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
        data = elf_sec_data(obj, scn);
        if (!scn || !data) {
                pr_warn("elf: failed to get %s map definitions for %s\n",
                        MAPS_ELF_SEC, obj->path);
                return -EINVAL;
        }
        [...]

2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
   9647c57b11 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
   better performance") in bpf-next and e20f0dbf20 ("net/mlx5e: RX, Add a prefetch
   command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
   net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:

        [...]
        xdp_set_data_meta_invalid(xdp);
        xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
        net_prefetch(xdp->data);
        [...]

We've added 133 non-merge commits during the last 14 day(s) which contain
a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).

The main changes are:

1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
   for tracing to reliably access user memory, from Alexei Starovoitov.

2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.

3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.

4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.

5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.

6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.

7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.

8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.

9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.

10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.

11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.

12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.

13) Various smaller cleanups and improvements all over the place.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 13:22:59 -07:00
Yutaro Hayakawa
ffa81fa46e net/tls: Implement getsockopt SOL_TLS TLS_RX
Implement the getsockopt SOL_TLS TLS_RX which is currently missing. The
primary usecase is to use it in conjunction with TCP_REPAIR to
checkpoint/restore the TLS record layer state.

TLS connection state usually exists on the user space library. So
basically we can easily extract it from there, but when the TLS
connections are delegated to the kTLS, it is not the case. We need to
have a way to extract the TLS state from the kernel for both of TX and
RX side.

The new TLS_RX getsockopt copies the crypto_info to user in the same
way as TLS_TX does.

We have described use cases in our research work in Netdev 0x14
Transport Workshop [1].

Also, there is an TLS implementation called tlse [2] which supports
TLS connection migration. They have support of kTLS and their code
shows that they are expecting the future support of this option.

[1] https://speakerdeck.com/yutarohayakawa/prism-proxies-without-the-pain
[2] https://github.com/eduardsui/tlse

Signed-off-by: Yutaro Hayakawa <yhayakawa3720@gmail.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 11:47:12 -07:00
Leesoo Ahn
355db39110 pktgen: fix error message with wrong function name
Error on calling kthread_create_on_node prints wrong function name,
kernel_thread.

Fixes: 94dcf29a11 ("kthread: use kthread_create_on_node()")
Signed-off-by: Leesoo Ahn <dev@ooseel.net>
Acked-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 11:44:53 -07:00
Tonghao Zhang
e6896163b2 net: openvswitch: remove unused keep_flows
keep_flows was introduced by [1], which used as flag to delete flows or not.
When rehashing or expanding the table instance, we will not flush the flows.
Now don't use it anymore, remove it.

[1] - acd051f176
Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 11:42:15 -07:00
Tonghao Zhang
df68d64ee3 net: openvswitch: refactor flow free function
Decrease table->count and ufid_count unconditionally,
because we only don't use count or ufid_count to count
when flushing the flows. To simplify the codes, we
remove the "count" argument of table_instance_flow_free.

To avoid a bug when deleting flows in the future, add
WARN_ON in flush flows function.

Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 11:42:15 -07:00
Tonghao Zhang
cf3266ad48 net: openvswitch: improve the coding style
Not change the logic, just improve the coding style.

Cc: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-09-01 11:42:15 -07:00
Abhishek Pandit-Subedi
0e9952804e Bluetooth: Clear suspend tasks on unregister
While unregistering, make sure to clear the suspend tasks before
cancelling the work. If the unregister is called during resume from
suspend, this will unnecessarily add 2s to the resume time otherwise.

Fixes: 4e8c36c3b0 (Bluetooth: Fix suspend notifier race)
Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-09-01 13:22:41 +02:00
Yaroslav Bolyukin
144b0a0e60 ipvs: remove dependency on ip6_tables
This dependency was added because ipv6_find_hdr was in iptables specific
code but is no longer required

Fixes: f8f626754e ("ipv6: Move ipv6_find_hdr() out of Netfilter code.")
Fixes: 63dca2c0b0 ("ipvs: Fix faulty IPv6 extension header handling in IPVS")
Signed-off-by: Yaroslav Bolyukin <iam@lach.pw>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-31 23:06:51 +02:00
Miaohe Lin
34e1ec319e net: ipv4: remove unused arg exact_dif in compute_score
The arg exact_dif is not used anymore, remove it. inet_exact_dif_match()
is no longer needed after the above is removed, so remove it too.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 13:08:29 -07:00
Miaohe Lin
3f7d820bad net: ipv6: remove unused arg exact_dif in compute_score
The arg exact_dif is not used anymore, remove it. inet6_exact_dif_match()
is no longer needed after the above is removed, remove it too.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 13:08:10 -07:00
YueHaibing
622a63f6f3 tipc: Remove unused macro TIPC_NACK_INTV
There is no caller in tree any more.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:39:07 -07:00
YueHaibing
ff007a9ba2 tipc: Remove unused macro TIPC_FWD_MSG
There is no caller in tree any more.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:38:48 -07:00
YueHaibing
b1fd4470cd mptcp: Remove unused macro MPTCP_SAME_STATE
There is no caller in tree any more.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:37:40 -07:00
Miaohe Lin
5af68891dc net: clean up codestyle
This is a pure codestyle cleanup patch. No functional change intended.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:33:34 -07:00
Miaohe Lin
cbc08a3312 net: Use helper macro IP_MAX_MTU in __ip_append_data()
What 0xFFFF means here is actually the max mtu of a ip packet. Use help
macro IP_MAX_MTU here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:33:16 -07:00
wenxu
a7c978c6c9 openvswitch: using ip6_fragment in ipv6_stub
Using ipv6_stub->ipv6_fragment to avoid the netfilter dependency

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:26:39 -07:00
wenxu
1d97898b36 ipv6: add ipv6_fragment hook in ipv6_stub
Add ipv6_fragment to ipv6_stub to avoid calling netfilter when
access ip6_fragment.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:26:39 -07:00
Magnus Karlsson
a1132430c2 xsk: Add shared umem support between devices
Add support to share a umem between different devices. This mode
can be invoked with the XDP_SHARED_UMEM bind flag. Previously,
sharing was only supported within the same device. Note that when
sharing a umem between devices, just as in the case of sharing a
umem between queue ids, you need to create a fill ring and a
completion ring and tie them to the socket (with two setsockopts,
one for each ring) before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics of the rings can be upheld.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-13-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
b5aea28dca xsk: Add shared umem support between queue ids
Add support to share a umem between queue ids on the same
device. This mode can be invoked with the XDP_SHARED_UMEM bind
flag. Previously, sharing was only supported within the same
queue id and device, and you shared one set of fill and
completion rings. However, note that when sharing a umem between
queue ids, you need to create a fill ring and a completion ring
and tie them to the socket before you do the bind with the
XDP_SHARED_UMEM flag. This so that the single-producer
single-consumer semantics can be upheld.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-12-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
921b68692a xsk: Enable sharing of dma mappings
Enable the sharing of dma mappings by moving them out from the buffer
pool. Instead we put each dma mapped umem region in a list in the umem
structure. If dma has already been mapped for this umem and device, it
is not mapped again and the existing dma mappings are reused.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-9-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
7f7ffa4e9c xsk: Move addrs from buffer pool to umem
Replicate the addrs pointer in the buffer pool to the umem. This mapping
will be the same for all buffer pools sharing the same umem. In the
buffer pool we leave the addrs pointer for performance reasons.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-8-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
a5aa8e529e xsk: Move xsk_tx_list and its lock to buffer pool
Move the xsk_tx_list and the xsk_tx_list_lock from the umem to
the buffer pool. This so that we in a later commit can share the
umem between multiple HW queues. There is one xsk_tx_list per
device and queue id, so it should be located in the buffer pool.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-7-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
c2d3d6a474 xsk: Move queue_id, dev and need_wakeup to buffer pool
Move queue_id, dev, and need_wakeup from the umem to the
buffer pool. This so that we in a later commit can share the umem
between multiple HW queues. There is one buffer pool per dev and
queue id, so these variables should belong to the buffer pool, not
the umem. Need_wakeup is also something that is set on a per napi
level, so there is usually one per device and queue id. So move
this to the buffer pool too.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-6-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
7361f9c3d7 xsk: Move fill and completion rings to buffer pool
Move the fill and completion rings from the umem to the buffer
pool. This so that we in a later commit can share the umem
between multiple HW queue ids. In this case, we need one fill and
completion ring per queue id. As the buffer pool is per queue id
and napi id this is a natural place for it and one umem
struture can be shared between these buffer pools.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-5-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
1c1efc2af1 xsk: Create and free buffer pool independently from umem
Create and free the buffer pool independently from the umem. Move
these operations that are performed on the buffer pool from the
umem create and destroy functions to new create and destroy
functions just for the buffer pool. This so that in later commits
we can instantiate multiple buffer pools per umem when sharing a
umem between HW queues and/or devices. We also erradicate the
back pointer from the umem to the buffer pool as this will not
work when we introduce the possibility to have multiple buffer
pools per umem.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-4-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
c4655761d3 xsk: i40e: ice: ixgbe: mlx5: Rename xsk zero-copy driver interfaces
Rename the AF_XDP zero-copy driver interface functions to better
reflect what they do after the replacement of umems with buffer
pools in the previous commit. Mostly it is about replacing the
umem name from the function names with xsk_buff and also have
them take the a buffer pool pointer instead of a umem. The
various ring functions have also been renamed in the process so
that they have the same naming convention as the internal
functions in xsk_queue.h. This so that it will be clearer what
they do and also for consistency.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-3-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:04 +02:00
Magnus Karlsson
1742b3d528 xsk: i40e: ice: ixgbe: mlx5: Pass buffer pool to driver instead of umem
Replace the explicit umem reference passed to the driver in AF_XDP
zero-copy mode with the buffer pool instead. This in preparation for
extending the functionality of the zero-copy mode so that umems can be
shared between queues on the same netdev and also between netdevs. In
this commit, only an umem reference has been added to the buffer pool
struct. But later commits will add other entities to it. These are
going to be entities that are different between different queue ids
and netdevs even though the umem is shared between them.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Link: https://lore.kernel.org/bpf/1598603189-32145-2-git-send-email-magnus.karlsson@intel.com
2020-08-31 21:15:03 +02:00
Johannes Berg
c30a3c957c netlink: policy: correct validation type check
In the policy export for binary attributes I erroneously used
a != NLA_VALIDATE_NONE comparison instead of checking for the
two possible values, which meant that if a validation function
pointer ended up aliasing the min/max as negatives, we'd hit
a warning in nla_get_range_unsigned().

Fix this to correctly check for only the two types that should
be handled here, i.e. range with or without warn-too-long.

Reported-by: syzbot+353df1490da781637624@syzkaller.appspotmail.com
Fixes: 8aa26c575f ("netlink: make NLA_BINARY validation more flexible")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 12:01:15 -07:00
David S. Miller
e9d572d94e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Do not delete clash entries on reply, let them expire instead,
   from Florian Westphal.

2) Do not report EAGAIN to nfnetlink, otherwise this enters a busy loop.
   Update nfnetlink_unicast() to translate EAGAIN to ENOBUFS.

3) Remove repeated words in code comments, from Randy Dunlap.

4) Several patches for the flowtable selftests, from Fabian Frederick.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-31 11:22:30 -07:00
Tuong Lien
bb8872a1e6 tipc: fix using smp_processor_id() in preemptible
The 'this_cpu_ptr()' is used to obtain the AEAD key' TFM on the current
CPU for encryption, however the execution can be preemptible since it's
actually user-space context, so the 'using smp_processor_id() in
preemptible' has been observed.

We fix the issue by using the 'get/put_cpu_ptr()' API which consists of
a 'preempt_disable()' instead.

Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-30 19:12:17 -07:00
Balazs Scheidler
67407a406d netfilter: nft_socket: add wildcard support
Add NFT_SOCKET_WILDCARD to match to wildcard socket listener.

Signed-off-by: Balazs Scheidler <bazsi77@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-29 13:04:44 +02:00
Florian Westphal
c46172147e netfilter: conntrack: do not auto-delete clash entries on reply
Its possible that we have more than one packet with the same ct tuple
simultaneously, e.g. when an application emits n packets on same UDP
socket from multiple threads.

NAT rules might be applied to those packets. With the right set of rules,
n packets will be mapped to m destinations, where at least two packets end
up with the same destination.

When this happens, the existing clash resolution may merge the skb that
is processed after the first has been received with the identical tuple
already in hash table.

However, its possible that this identical tuple is a NAT_CLASH tuple.
In that case the second skb will be sent, but no reply can be received
since the reply that is processed first removes the NAT_CLASH tuple.

Do not auto-delete, this gives a 1 second window for replies to be passed
back to originator.

Packets that are coming later (udp stream case) will not be affected:
they match the original ct entry, not a NAT_CLASH one.

Also prevent NAT_CLASH entries from getting offloaded.

Fixes: 6a757c07e5 ("netfilter: conntrack: allow insertion of clashing entries")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-29 13:03:06 +02:00
Pablo Neira Ayuso
ee92118355 netfilter: nfnetlink: nfnetlink_unicast() reports EAGAIN instead of ENOBUFS
Frontend callback reports EAGAIN to nfnetlink to retry a command, this
is used to signal that module autoloading is required. Unfortunately,
nlmsg_unicast() reports EAGAIN in case the receiver socket buffer gets
full, so it enters a busy-loop.

This patch updates nfnetlink_unicast() to turn EAGAIN into ENOBUFS and
to use nlmsg_unicast(). Remove the flags field in nfnetlink_unicast()
since this is always MSG_DONTWAIT in the existing code which is exactly
what nlmsg_unicast() passes to netlink_unicast() as parameter.

Fixes: 96518518cc ("netfilter: add nftables")
Reported-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 20:11:58 +02:00
Randy Dunlap
4b7ddc58e6 netfilter: delete repeated words
Drop duplicated words in net/netfilter/ and net/ipv4/netfilter/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Reviewed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 20:11:38 +02:00
YueHaibing
f5143e10a2 netfilter: xt_HMARK: Use ip_is_fragment() helper
Use ip_is_fragment() to simpify code.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:55:51 +02:00
Florian Westphal
ff73e7479b netfilter: conntrack: remove unneeded nf_ct_put
We can delay refcount increment until we reassign the existing entry to
the current skb.

A 0 refcount can't happen while the nf_conn object is still in the
hash table and parallel mutations are impossible because we hold the
bucket lock.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:27 +02:00
Florian Westphal
bc92470413 netfilter: conntrack: add clash resolution stat counter
There is a misconception about what "insert_failed" means.

We increment this even when a clash got resolved, so it might not indicate
a problem.

Add a dedicated counter for clash resolution and only increment
insert_failed if a clash cannot be resolved.

For the old /proc interface, export this in place of an older stat
that got removed a while back.
For ctnetlink, export this with a new attribute.

Also correct an outdated comment that implies we add a duplicate tuple --
we only add the (unique) reply direction.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:26 +02:00
Florian Westphal
4afc41dfa5 netfilter: conntrack: remove ignore stats
This counter increments when nf_conntrack_in sees a packet that already
has a conntrack attached or when the packet is marked as UNTRACKED.
Neither is an error.

The former is normal for loopback traffic.  The second happens for
certain ICMPv6 packets or when nftables/ip(6)tables rules are in place.

In case someone needs to count UNTRACKED packets, or packets
that are marked as untracked before conntrack_in this can be done with
both nftables and ip(6)tables rules.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:26 +02:00
Florian Westphal
b1328e54ac netfilter: conntrack: do not increment two error counters at same time
The /proc interface for nf_conntrack displays the "error" counter as
"icmp_error".

It makes sense to not increment "invalid" when failing to handle an icmp
packet since those are special.

For example, its possible for conntrack to see partial and/or fragmented
packets inside icmp errors.  This should be a separate event and not get
mixed with the "invalid" counter.

Likewise, remove the "error" increment for errors from get_l4proto().
After this, the error counter will only increment for errors coming from
icmp(v6) packet handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:51:25 +02:00
Jose M. Guisado Gomez
7a81575b80 netfilter: nf_tables: add userdata attributes to nft_table
Enables storing userdata for nft_table. Field udata points to user data
and udlen store its length.

Adds new attribute flag NFTA_TABLE_USERDATA

Signed-off-by: Jose M. Guisado Gomez <guigom@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:18:48 +02:00
Peilin Ye
c5a8a8498e ipvs: Fix uninit-value in do_ip_vs_set_ctl()
do_ip_vs_set_ctl() is referencing uninitialized stack value when `len` is
zero. Fix it.

Reported-by: syzbot+23b5f9e7caf61d9a3898@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?id=46ebfb92a8a812621a001ef04d90dfa459520fe2
Suggested-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Reviewed-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:18:48 +02:00
Michael Zhou
d5608a0578 netfilter: ip6t_NPT: rewrite addresses in ICMPv6 original packet
Detect and rewrite a prefix embedded in an ICMPv6 original packet that was
rewritten by a corresponding DNPT/SNPT rule so it will be recognised by
the host that sent the original packet.

Example

Rules in effect on the 1:2:3:4::/64 + 5:6:7:8::/64 side router:
* SNPT src-pfx 1:2:3:4::/64 dst-pfx 5:6:7:8::/64
* DNPT src-pfx 5:6:7:8::/64 dst-pfx 1:2:3:4::/64

No rules on the 9🅰️b:c::/64 side.

1. 1:2:3:4::1 sends UDP packet to 9🅰️b:c::1
2. Router applies SNPT changing src to 5:6:7:8::ffef::1
3. 9🅰️b:c::1 receives packet with (src 5:6:7:8::ffef::1 dst 9🅰️b:c::1)
	and replies with ICMPv6 port unreachable to 5:6:7:8::ffef::1,
	including original packet (src 5:6:7:8::ffef::1 dst 9🅰️b:c::1)
4. Router forwards ICMPv6 packet with (src 9🅰️b:c::1 dst 5:6:7:8::ffef::1)
	including original packet (src 5:6:7:8::ffef::1 dst 9🅰️b:c::1)
	and applies DNPT changing dst to 1:2:3:4::1
5. 1:2:3:4::1 receives ICMPv6 packet with (src 9🅰️b:c::1 dst 1:2:3:4::1)
	including original packet (src 5:6:7:8::ffef::1 dst 9🅰️b:c::1).
	It doesn't recognise the original packet as the src doesn't
	match anything it originally sent

With this change, at step 4, DNPT will also rewrite the original packet
src to 1:2:3:4::1, so at step 5, 1:2:3:4::1 will recognise the ICMPv6
error and provide feedback to the application properly.

Conversely, SNPT will help when ICMPv6 errors are sent from the
translated network.

1. 9🅰️b:c::1 sends UDP packet to 5:6:7:8::ffef::1
2. Router applies DNPT changing dst to 1:2:3:4::1
3. 1:2:3:4::1 receives packet with (src 9🅰️b:c::1 dst 1:2:3:4::1)
	and replies with ICMPv6 port unreachable to 9🅰️b:c::1
	including original packet (src 9🅰️b:c::1 dst 1:2:3:4::1)
4. Router forwards ICMPv6 packet with (src 1:2:3:4::1 dst 9🅰️b:c::1)
	including original packet (src 9🅰️b:c::1 dst 1:2:3:4::1)
	and applies SNPT changing src to 5:6:7:8::ffef::1
5. 9🅰️b:c::1 receives ICMPv6 packet with
	(src 5:6:7:8::ffef::1 dst 9🅰️b:c::1) including
	original packet (src 9🅰️b:c::1 dst 1:2:3:4::1).
	It doesn't recognise the original packet as the dst doesn't
	match anything it already sent

The change to SNPT means the ICMPv6 original packet dst will be
rewritten to 5:6:7:8::ffef::1 in step 4, allowing the error to be
properly recognised in step 5.

Signed-off-by: Michael Zhou <mzhou@cse.unsw.edu.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-28 19:18:48 +02:00
Alex Dewar
0f091e4331 netlabel: remove unused param from audit_log_format()
Commit d3b990b7f3 ("netlabel: fix problems with mapping removal")
added a check to return an error if ret_val != 0, before ret_val is
later used in a log message. Now it will unconditionally print "...
res=1". So just drop the check.

Addresses-Coverity: ("Dead code")
Fixes: d3b990b7f3 ("netlabel: fix problems with mapping removal")
Signed-off-by: Alex Dewar <alex.dewar90@gmail.com>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28 09:08:51 -07:00
Cong Wang
5438dd4583 net_sched: fix error path in red_init()
When ->init() fails, ->destroy() is called to clean up.
So it is unnecessary to clean up in red_init(), and it
would cause some refcount underflow.

Fixes: aee9caa03f ("net: sched: sch_red: Add qevents "early_drop" and "mark"")
Reported-and-tested-by: syzbot+b33c1cb0a30ebdc8a5f9@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+e5ea5f8a3ecfd4427a1c@syzkaller.appspotmail.com
Cc: Petr Machata <petrm@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28 07:16:46 -07:00
Mahesh Bandewar
316cdaa115 net: add option to not create fall-back tunnels in root-ns as well
The sysctl that was added  earlier by commit 79134e6ce2 ("net: do
not create fallback tunnels for non-default namespaces") to create
fall-back only in root-ns. This patch enhances that behavior to provide
option not to create fallback tunnels in root-ns as well. Since modules
that create fallback tunnels could be built-in and setting the sysctl
value after booting is pointless, so added a kernel cmdline options to
change this default. The default setting is preseved for backward
compatibility. The kernel command line option of fb_tunnels=initns will
set the sysctl value to 1 and will create fallback tunnels only in initns
while kernel cmdline fb_tunnels=none will set the sysctl value to 2 and
fallback tunnels are skipped in every netns.

Signed-off-by: Mahesh Bandewar <maheshb@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Maciej Zenczykowski <maze@google.com>
Cc: Jian Yang <jianyang@google.com>
Cc: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28 06:52:44 -07:00
zhudi
174bce38ca netlink: fix a data race in netlink_rcv_wake()
The data races were reported by KCSAN:
BUG: KCSAN: data-race in netlink_recvmsg / skb_queue_tail

write (marked) to 0xffff8c0986e5a8c8 of 8 bytes by interrupt on cpu 3:
 skb_queue_tail+0xcc/0x120
 __netlink_sendskb+0x55/0x80
 netlink_broadcast_filtered+0x465/0x7e0
 nlmsg_notify+0x8f/0x120
 rtnl_notify+0x8e/0xb0
 __neigh_notify+0xf2/0x120
 neigh_update+0x927/0xde0
 arp_process+0x8a3/0xf50
 arp_rcv+0x27c/0x3b0
 __netif_receive_skb_core+0x181c/0x1840
 __netif_receive_skb+0x38/0xf0
 netif_receive_skb_internal+0x77/0x1c0
 napi_gro_receive+0x1bd/0x1f0
 e1000_clean_rx_irq+0x538/0xb20 [e1000]
 e1000_clean+0x5e4/0x1340 [e1000]
 net_rx_action+0x310/0x9d0
 __do_softirq+0xe8/0x308
 irq_exit+0x109/0x110
 do_IRQ+0x7f/0xe0
 ret_from_intr+0x0/0x1d
 0xffffffffffffffff

read to 0xffff8c0986e5a8c8 of 8 bytes by task 1463 on cpu 0:
 netlink_recvmsg+0x40b/0x820
 sock_recvmsg+0xc9/0xd0
 ___sys_recvmsg+0x1a4/0x3b0
 __sys_recvmsg+0x86/0x120
 __x64_sys_recvmsg+0x52/0x70
 do_syscall_64+0xb5/0x360
 entry_SYSCALL_64_after_hwframe+0x65/0xca
 0xffffffffffffffff

Since the write is under sk_receive_queue->lock but the read
is done as lockless. so fix it by using skb_queue_empty_lockless()
instead of skb_queue_empty() for the read in netlink_rcv_wake()

Signed-off-by: zhudi <zhudi21@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28 06:51:11 -07:00
Martin KaFai Lau
134fede4ee bpf: Relax max_entries check for most of the inner map types
Most of the maps do not use max_entries during verification time.
Thus, those map_meta_equal() do not need to enforce max_entries
when it is inserted as an inner map during runtime.  The max_entries
check is removed from the default implementation bpf_map_meta_equal().

The prog_array_map and xsk_map are exception.  Its map_gen_lookup
uses max_entries to generate inline lookup code.  Thus, they will
implement its own map_meta_equal() to enforce max_entries.
Since there are only two cases now, the max_entries check
is not refactored and stays in its own .c file.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011813.1970516-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
Martin KaFai Lau
f4d0525921 bpf: Add map_meta_equal map ops
Some properties of the inner map is used in the verification time.
When an inner map is inserted to an outer map at runtime,
bpf_map_meta_equal() is currently used to ensure those properties
of the inserting inner map stays the same as the verification
time.

In particular, the current bpf_map_meta_equal() checks max_entries which
turns out to be too restrictive for most of the maps which do not use
max_entries during the verification time.  It limits the use case that
wants to replace a smaller inner map with a larger inner map.  There are
some maps do use max_entries during verification though.  For example,
the map_gen_lookup in array_map_ops uses the max_entries to generate
the inline lookup code.

To accommodate differences between maps, the map_meta_equal is added
to bpf_map_ops.  Each map-type can decide what to check when its
map is used as an inner map during runtime.

Also, some map types cannot be used as an inner map and they are
currently black listed in bpf_map_meta_alloc() in map_in_map.c.
It is not unusual that the new map types may not aware that such
blacklist exists.  This patch enforces an explicit opt-in
and only allows a map to be used as an inner map if it has
implemented the map_meta_equal ops.  It is based on the
discussion in [1].

All maps that support inner map has its map_meta_equal points
to bpf_map_meta_equal in this patch.  A later patch will
relax the max_entries check for most maps.  bpf_types.h
counts 28 map types.  This patch adds 23 ".map_meta_equal"
by using coccinelle.  -5 for
	BPF_MAP_TYPE_PROG_ARRAY
	BPF_MAP_TYPE_(PERCPU)_CGROUP_STORAGE
	BPF_MAP_TYPE_STRUCT_OPS
	BPF_MAP_TYPE_ARRAY_OF_MAPS
	BPF_MAP_TYPE_HASH_OF_MAPS

The "if (inner_map->inner_map_meta)" check in bpf_map_meta_alloc()
is moved such that the same error is returned.

[1]: https://lore.kernel.org/bpf/20200522022342.899756-1-kafai@fb.com/

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200828011806.1970400-1-kafai@fb.com
2020-08-28 15:41:30 +02:00
David S. Miller
ae9a138f06 This time we have:
* some code to support SAE (WPA3) offload in AP mode
  * many documentation (wording) fixes/updates
  * netlink policy updates, including the use of NLA_RANGE
    with binary attributes
  * regulatory improvements for adjacent frequency bands
  * and a few other small additions/refactorings/cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl9I2GQACgkQB8qZga/f
 l8R3QhAAkVkBKrjRG26/LILGfeznpqIXgZW9jaUA6KhXHrd5hXEu2aXb+iBIETK6
 CUIKcEIsuO8Uvonu0q0BbAxinVdXIZdxdmY283z4zOS3ngh5ia9GPeNoh50qbEf3
 v6h2qitqRLzIuTDnFneK8UQe5WtFWcuyYKaJjr1YjYZFBkTkoUVTQ2AA8BHKA/Qa
 Ng6PgSx5fo5uOztFGJOvF0y2LMvoSgGGggn6ODKYIrwNI+95iE2W9nG1nUVSzy7e
 N6Codaqnh4/TnyQABA1REl7OPHe5ZJrApytS3H/pqqsLfMcYByGC3ebThesNDwzM
 ptLGhAbDZq7f1vmYMwkUowha8iOD0LzibDBIl8235Z3oIDrgvwGRonERZnmo+aCa
 vEQS8MFmseCrsxheicrzo5FWLc4rt9fU5nC4IgpZLZozU8ZFP+rKo7haHpO7a69k
 xED3SYoB461y+5TQgNPWSeSjvv3P0LjoS29xHAVsu0pp2Gf/cKg75bPJLbsrTYlX
 oKUMF3djN0e9+LOvtuuQYvaZfilpsd/6nBbz5r3sEeRXkJkUcOsGkgVTwX4VtsNT
 oPGjF2yOmj/r6J1blOZYxnSWiyWLYk3r8Hkym8bp3esJhJbJKgTDziJm+g0kGiJt
 FGTlQVGNKRdHRRzgMTOSGonI2nmaE5CWSJCHcSSMX8rfZFgT1ng=
 =Ys45
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2020-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time we have:
 * some code to support SAE (WPA3) offload in AP mode
 * many documentation (wording) fixes/updates
 * netlink policy updates, including the use of NLA_RANGE
   with binary attributes
 * regulatory improvements for adjacent frequency bands
 * and a few other small additions/refactorings/cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28 06:17:40 -07:00
David S. Miller
51458c9705 We have:
* fixes for AQL (airtime queue limits)
  * reduce packet loss detection false positives
  * a small channel number fix for the 6 GHz band
  * a fix for 80+80/160 MHz negotiation
  * an nl80211 attribute (NL80211_ATTR_HE_6GHZ_CAPABILITY) fix
  * add a missing sanity check for the regulatory code
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl9I10UACgkQB8qZga/f
 l8TVrQ/8CNCAhEwaGVj4SxjJAlma7vnMffxR5BXboPsWm3fVzknJnLvlLyTpXJbH
 lXadT3fv4M9dxKFIqfJiak7LK4c8rv64xeYXelZWVq43S1DjVV3mZOaTehOPbxtM
 Wo+d9mUkSH7PgbgipHsR01/fdFCEwu0uCqqG6syOj1+f3Pcvaofwv8vZyhtWwkTk
 ZchvdtU2Q/Nh0SEaIKg6esNogfPBtgSrCX7CV/uNbfeyISL5iSeU6DZeZZLDdcD9
 b66PNs+DFAI2XIr/2hmHWtkVOUIhAY9EQfICaBvUZryyvFDR9ycSO5khginZP47/
 88gjJ2loYt9M+p00/FcYXkNzxwdo9tppHQ7XmZi4Ciwt7ADuvDlTdehS/ix93HkS
 5hC5UOz8GNy1XZjayqzSVn3AwWuEncRM67NQo3J6qwS1vnF4NKi7rdwSDO49OX3V
 aDAXGiALo2V9ZgbtOOlSbUl1y107mPtemkmdAOcMvXCLXFYixUeUgULTvl4S72fk
 RVxb0xp7lJ0T5hFXGGcTO/MV9X5Rg/d/Y1AKhCy3Wt2bdNk0K5JOcblnrXp8dz83
 qveiscDy4B6NEnaPOffS5z/7E5wXIN77L9TYKOnQ/v8Gu+r15imyzcgz/s8EvRmJ
 rWhB45SUhc5CBIjTFa4iR8M8u1Wc/a+I5Zht39VRTCLtDlOADZw=
 =zMP7
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2020-08-28' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
We have:
 * fixes for AQL (airtime queue limits)
 * reduce packet loss detection false positives
 * a small channel number fix for the 6 GHz band
 * a fix for 80+80/160 MHz negotiation
 * an nl80211 attribute (NL80211_ATTR_HE_6GHZ_CAPABILITY) fix
 * add a missing sanity check for the regulatory code
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-28 06:16:48 -07:00
Dinghao Liu
b43c75abfd rxrpc: Fix memory leak in rxkad_verify_response()
Fix a memory leak in rxkad_verify_response() whereby the response buffer
doesn't get freed if we fail to allocate a ticket buffer.

Fixes: ef68622da9 ("rxrpc: Handle temporary errors better in rxkad security")
Signed-off-by: Dinghao Liu <dinghao.liu@zju.edu.cn>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-27 12:59:45 -07:00
David S. Miller
8d73a73a7f RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl9HwPIACgkQ+7dXa6fL
 C2vHdw//S5s+nPNuJUpz3aeYDC1Yl1YP+r2+XBfcCNem504GKfsvTBbLo7FYP6Oc
 TCLJouPxoNSWAtlrJ86YJu8EcFdYNI4w6mCHncIrLXiiIZhsAeUMAw1GiASL5VUr
 QhLjJJU4zUBg3/RWLRC1pUXBXAa3sYx5r1p8j3KdBAkAmAzXdFWSRFeffVeXIk46
 FZ52YoQkplJYqDL+1oQCjJLetVJGGvc68AIOjJOLh6CAjrZbx1aiY79XA+flsoE9
 B+KVjhEn0f9dVybgtF4rEqS88y1FJtSQgul6FOsys+Rx2I0G7ei8PB5TQ2wNCQhy
 37gGfg1AV6apcqXKcHJdovVnApoQzzdCJESCgbMvsczqM88dP7pLWrPz4wpxs655
 3t6qUyXI6yMOJhUWPOoFi30+4NM+MmsqrbYLZt2f/aXRhKnyaeaLd+VAIlkgz3Lj
 sZuuHQsTH0xiR2uCsINWEc7d7UV6WjeVUJ77LzYiiRzEC2pX80tGu5EKHN8W9oBk
 xRAuExXEGtyOR0p3/S3StkT490Tt4bIxUwnKYAaQZEydMCnTlWVbtIXwzmlCbCSB
 p+P/7twR7LiQlHCTJU94jH3Jfpm3jpFpgjQZoZcx6ZLvtSH4+QP11nMY9+sRxEz1
 hpB12AY7Wp2N7P+GhBPuXUCQpzW751aNZz4X9Etu6kRgwjuyaJM=
 =FWHJ
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200820' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc, afs: Fix probing issues

Here are some fixes for rxrpc and afs to fix issues in the RTT measuring in
rxrpc and thence the Volume Location server probing in afs:

 (1) Move the serial number of a received ACK into a local variable to
     simplify the next patch.

 (2) Fix the loss of RTT samples due to extra interposed ACKs causing
     baseline information to be discarded too early.  This is a particular
     problem for afs when it sends a single very short call to probe a
     server it hasn't talked to recently.

 (3) Fix rxrpc_kernel_get_srtt() to indicate whether it actually has seen
     any valid samples or not.

 (4) Remove a field that's set/woken, but never read/waited on.

 (5) Expose the RTT and other probe information through procfs to make
     debugging of this stuff easier.

 (6) Fix VL rotation in afs to only use summary information from VL probing
     and not the probe running state (which gets clobbered when next a
     probe is issued).

 (7) Fix VL rotation to actually return the error aggregated from the probe
     errors.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-27 12:55:46 -07:00
Linus Lüssing
097930e85f batman-adv: bla: fix type misuse for backbone_gw hash indexing
It seems that due to a copy & paste error the void pointer
in batadv_choose_backbone_gw() is cast to the wrong type.

Fixing this by using "struct batadv_bla_backbone_gw" instead of "struct
batadv_bla_claim" which better matches the caller's side.

For now it seems that we were lucky because the two structs both have
their orig/vid and addr/vid in the beginning. However I stumbled over
this issue when I was trying to add some debug variables in front of
"orig" in batadv_backbone_gw, which caused hash lookups to fail.

Fixes: 07568d0369 ("batman-adv: don't rely on positions in struct for hashing")
Signed-off-by: Linus Lüssing <ll@simonwunderlich.de>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
2020-08-27 17:41:45 +02:00
Miaohe Lin
645f08975f net: Fix some comments
Fix some comments, including wrong function name, duplicated word and so
on.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-27 07:55:59 -07:00
Hoang Huu Le
fdeba99b1e tipc: fix use-after-free in tipc_bcast_get_mode
Syzbot has reported those issues as:

==================================================================
BUG: KASAN: use-after-free in tipc_bcast_get_mode+0x3ab/0x400 net/tipc/bcast.c:759
Read of size 1 at addr ffff88805e6b3571 by task kworker/0:6/3850

CPU: 0 PID: 3850 Comm: kworker/0:6 Not tainted 5.8.0-rc7-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events tipc_net_finalize_work

Thread 1's call trace:
[...]
  kfree+0x103/0x2c0 mm/slab.c:3757 <- bcbase releasing
  tipc_bcast_stop+0x1b0/0x2f0 net/tipc/bcast.c:721
  tipc_exit_net+0x24/0x270 net/tipc/core.c:112
[...]

Thread 2's call trace:
[...]
  tipc_bcast_get_mode+0x3ab/0x400 net/tipc/bcast.c:759 <- bcbase
has already been freed by Thread 1

  tipc_node_broadcast+0x9e/0xcc0 net/tipc/node.c:1744
  tipc_nametbl_publish+0x60b/0x970 net/tipc/name_table.c:752
  tipc_net_finalize net/tipc/net.c:141 [inline]
  tipc_net_finalize+0x1fa/0x310 net/tipc/net.c:131
  tipc_net_finalize_work+0x55/0x80 net/tipc/net.c:150
[...]

==================================================================
BUG: KASAN: use-after-free in tipc_named_reinit+0xef/0x290 net/tipc/name_distr.c:344
Read of size 8 at addr ffff888052ab2000 by task kworker/0:13/30628
CPU: 0 PID: 30628 Comm: kworker/0:13 Not tainted 5.8.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events tipc_net_finalize_work
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1f0/0x31e lib/dump_stack.c:118
 print_address_description+0x66/0x5a0 mm/kasan/report.c:383
 __kasan_report mm/kasan/report.c:513 [inline]
 kasan_report+0x132/0x1d0 mm/kasan/report.c:530
 tipc_named_reinit+0xef/0x290 net/tipc/name_distr.c:344
 tipc_net_finalize+0x85/0xe0 net/tipc/net.c:138
 tipc_net_finalize_work+0x50/0x70 net/tipc/net.c:150
 process_one_work+0x789/0xfc0 kernel/workqueue.c:2269
 worker_thread+0xaa4/0x1460 kernel/workqueue.c:2415
 kthread+0x37e/0x3a0 drivers/block/aoe/aoecmd.c:1234
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293
[...]
Freed by task 14058:
 save_stack mm/kasan/common.c:48 [inline]
 set_track mm/kasan/common.c:56 [inline]
 kasan_set_free_info mm/kasan/common.c:316 [inline]
 __kasan_slab_free+0x114/0x170 mm/kasan/common.c:455
 __cache_free mm/slab.c:3426 [inline]
 kfree+0x10a/0x220 mm/slab.c:3757
 tipc_exit_net+0x29/0x50 net/tipc/core.c:113
 ops_exit_list net/core/net_namespace.c:186 [inline]
 cleanup_net+0x708/0xba0 net/core/net_namespace.c:603
 process_one_work+0x789/0xfc0 kernel/workqueue.c:2269
 worker_thread+0xaa4/0x1460 kernel/workqueue.c:2415
 kthread+0x37e/0x3a0 drivers/block/aoe/aoecmd.c:1234
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293

Fix it by calling flush_scheduled_work() to make sure the
tipc_net_finalize_work() stopped before releasing bcbase object.

Reported-by: syzbot+6ea1f7a8df64596ef4d7@syzkaller.appspotmail.com
Reported-by: syzbot+e9cc557752ab126c1b99@syzkaller.appspotmail.com
Acked-by: Jon Maloy <jmaloy@redhat.com>
Signed-off-by: Hoang Huu Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-27 06:59:42 -07:00
Chung-Hsien Hsu
2831a63102 nl80211: support SAE authentication offload in AP mode
Let drivers advertise support for AP-mode SAE authentication offload
with a new NL80211_EXT_FEATURE_SAE_OFFLOAD_AP flag.

Signed-off-by: Chung-Hsien Hsu <stanley.hsu@cypress.com>
Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Link: https://lore.kernel.org/r/20200817073316.33402-4-stanley.hsu@cypress.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 15:19:44 +02:00
John Crispin
8552a434b6 mac80211: rename csa counters to countdown counters
We want to reuse the functions and structs for other counters such as BSS
color change. Rename them to more generic names.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200811080107.3615705-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 14:12:15 +02:00
John Crispin
00c207edfb nl80211: rename csa counter attributes countdown counters
We want to reuse the attributes for other counters such as BSS color
change. Rename them to more generic names.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200811080107.3615705-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 14:12:15 +02:00
Miles Hu
eb89a6a6b7 nl80211: add support for setting fixed HE rate/gi/ltf
This patch adds the nl80211 structs, definitions, policies and parsing
code required to pass fixed HE rate, GI and LTF settings.

Signed-off-by: Miles Hu <milehu@codeaurora.org>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200804081630.2013619-1-john@phrozen.org
[fix comment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 14:12:14 +02:00
Markus Theil
12adee3c46 cfg80211: add helper fn for adjacent rule channels
Some usable channels are located in the union of adjacent
regulatory rules, for example channel 144 in Germany.

Enable them, by also checking if a channel spans two adjacent
regulatory rules/frequency ranges.

All flags involved are disabling things, therefore we can build
the maximum by or-ing them together. Furthermore, take the maximum
of DFS CAC time values and the minimum of allowed power of both
adjacent channels in order to comply with both regulatory rules at
the same time.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200803144353.305538-2-markus.theil@tu-ilmenau.de
[remove unrelated comment changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:27:17 +02:00
Markus Theil
7c9ff7e232 cfg80211: add helper fn for single rule channels
As a preparation to handle adjacent rule channels,
factor out handling channels located in a single
regulatory rule.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200803144353.305538-1-markus.theil@tu-ilmenau.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:25:45 +02:00
Johannes Berg
c8b8280230 nl80211: use NLA_POLICY_RANGE(NLA_BINARY, ...) for a few attributes
We have a few attributes with minimum and maximum lengths that are
not the same, use the new feature of being able to specify both in
the policy to validate them, removing code and allowing this to be
advertised to userspace in the policy export.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200819085642.8f12ffa14f33.I9d948d59870e521febcd79bb4a986b1de1dca47b@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:25:06 +02:00
Johannes Berg
cb9abd48d9 nl80211: clean up code/policy a bit
Use the policy to validate minimum and exact lengths in
some attributes that weren't previously covered in the
right ways, and remove associated validation code.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200805154714.3ba1e233cfa0.I5dc8109b7ab5c3f4ae925f903a30cc9b35753262@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:24:56 +02:00
Miaohe Lin
7b506ff6f6 net: wireless: Convert to use the preferred fallthrough macro
Convert the uses of fallthrough comments to fallthrough macro.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Link: https://lore.kernel.org/r/20200822082323.45495-1-linmiaohe@huawei.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:24:28 +02:00
Randy Dunlap
eee79f8094 net: wireless: wext_compat.c: delete duplicated word
Drop the repeated word "be".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-8-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:38 +02:00
Randy Dunlap
54f65de004 net: wireless: sme.c: delete duplicated word
Drop the repeated word "is".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-7-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:31 +02:00
Randy Dunlap
8cf5c86d55 net: wireless: scan.c: delete or fix duplicated words
Drop repeated word "stored".
Change "is is" to "it is".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-6-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:26 +02:00
Randy Dunlap
cc5a639b03 net: wireless: reg.c: delete duplicated words + fix punctuation
Drop duplicated words "was" and "does".
Fix "let's" apostrophe.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-5-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:21 +02:00
Randy Dunlap
b42c8edfdb net: wireless: delete duplicated word + fix grammar
Drop the repeated word "Return" + fix verb.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-4-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:16 +02:00
Randy Dunlap
13880a3b55 net: mac80211: mesh.h: delete duplicated word
Drop the repeated word "address".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-3-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:11 +02:00
Randy Dunlap
39f774e78d net: mac80211: agg-rx.c: fix duplicated words
Change "If if" to "If it".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: Kalle Valo <kvalo@codeaurora.org>
Cc: linux-wireless@vger.kernel.org
Link: https://lore.kernel.org/r/20200822231953.465-2-rdunlap@infradead.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 11:23:08 +02:00
Amar Singhal
2d9b555085 cfg80211: Adjust 6 GHz frequency to channel conversion
Adjust the 6 GHz frequency to channel conversion function,
the other way around was previously handled.

Signed-off-by: Amar Singhal <asinghal@codeaurora.org>
Link: https://lore.kernel.org/r/1592599921-10607-1-git-send-email-asinghal@codeaurora.org
[rewrite commit message, hard-code channel 2]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:53:21 +02:00
Felix Fietkau
47df8e059b mac80211: reduce packet loss event false positives
When running a large number of packets per second with a high data rate
and long A-MPDUs, the packet loss threshold can be reached very quickly
when the link conditions change. This frequently shows up as spurious
disconnects.
Mitigate false positives by using a similar logic for regular stations
as the one being used for TDLS, though with a more aggressive timeout.
Packet loss events are only reported if no ACK was received for a second.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20200808172542.41628-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:53:20 +02:00
Johannes Berg
47caf685a6 cfg80211: regulatory: reject invalid hints
Reject invalid hints early in order to not cause a kernel
WARN later if they're restored to or similar.

Reported-by: syzbot+d451401ffd00a60677ee@syzkaller.appspotmail.com
Link: https://syzkaller.appspot.com/bug?extid=d451401ffd00a60677ee
Link: https://lore.kernel.org/r/20200819084648.13956-1-johannes@sipsolutions.net
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:53:14 +02:00
Shay Bar
3579994476 wireless: fix wrong 160/80+80 MHz setting
Fix cfg80211_chandef_usable():
consider IEEE80211_VHT_CAP_EXT_NSS_BW when verifying 160/80+80 MHz.

Based on:
"Table 9-272 — Setting of the Supported Channel Width Set subfield and Extended NSS BW
Support subfield at a STA transmitting the VHT Capabilities Information field"
From "Draft P802.11REVmd_D3.0.pdf"

Signed-off-by: Aviad Brikman <aviad.brikman@celeno.com>
Signed-off-by: Shay Bar <shay.bar@celeno.com>
Link: https://lore.kernel.org/r/20200826143139.25976-1-shay.bar@celeno.com
[reformat the code a bit and use u32_get_bits()]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:46:53 +02:00
Felix Fietkau
f01cfbaf9b mac80211: improve AQL aggregation estimation for low data rates
Links with low data rates use much smaller aggregates and are much more
sensitive to latency added by bufferbloat.
Tune the assumed aggregation length based on the tx rate duration.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200821163045.62140-3-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:28:41 +02:00
Felix Fietkau
43cd72c589 mac80211: factor out code to look up the average packet length duration for a rate
This will be used to enhance AQL estimated aggregation length

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200821163045.62140-2-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:28:34 +02:00
Felix Fietkau
8ed37e7919 mac80211: use rate provided via status->rate on ieee80211_tx_status_ext for AQL
Since ieee80211_tx_info does not have enough room to encode HE rates, HE
drivers use status->rate to provide rate info.
Store it in struct sta_info and use it for AQL.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200821163045.62140-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-27 10:28:21 +02:00
Sabrina Dubroca
45a36a18d0 xfrmi: drop ignore_df check before updating pmtu
xfrm interfaces currently test for !skb->ignore_df when deciding
whether to update the pmtu on the skb's dst. Because of this, no pmtu
exception is created when we do something like:

    ping -s 1438 <dest>

By dropping this check, the pmtu exception will be created and the
next ping attempt will work.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-08-27 08:37:50 +02:00
Jakub Kicinski
96e97bc07e net: disable netpoll on fresh napis
napi_disable() makes sure to set the NAPI_STATE_NPSVC bit to prevent
netpoll from accessing rings before init is complete. However, the
same is not done for fresh napi instances in netif_napi_add(),
even though we expect NAPI instances to be added as disabled.

This causes crashes during driver reconfiguration (enabling XDP,
changing the channel count) - if there is any printk() after
netif_napi_add() but before napi_enable().

To ensure memory ordering is correct we need to use RCU accessors.

Reported-by: Rob Sherwood <rsher@fb.com>
Fixes: 2d8bff1269 ("netpoll: Close race condition between poll_one_napi and napi_disable")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:16:39 -07:00
Ido Schimmel
885a3b1579 ipv4: nexthop: Correctly update nexthop group when replacing a nexthop
Each nexthop group contains an indication if it has IPv4 nexthops
('has_v4'). Its purpose is to prevent IPv6 routes from using groups with
IPv4 nexthops.

However, the indication is not updated when a nexthop is replaced. This
results in the kernel wrongly rejecting IPv6 routes from pointing to
groups that only contain IPv6 nexthops. Example:

# ip nexthop replace id 1 via 192.0.2.2 dev dummy10
# ip nexthop replace id 10 group 1
# ip nexthop replace id 1 via 2001:db8:1::2 dev dummy10
# ip route replace 2001:db8:10::/64 nhid 10
Error: IPv6 routes can not use an IPv4 nexthop.

Solve this by iterating over all the nexthop groups that the replaced
nexthop is a member of and potentially update their IPv4 indication
according to the new set of member nexthops.

Avoid wasting cycles by only performing the update in case an IPv4
nexthop is replaced by an IPv6 nexthop.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
863b25581c ipv4: nexthop: Correctly update nexthop group when removing a nexthop
Each nexthop group contains an indication if it has IPv4 nexthops
('has_v4'). Its purpose is to prevent IPv6 routes from using groups with
IPv4 nexthops.

However, the indication is not updated when a nexthop is removed. This
results in the kernel wrongly rejecting IPv6 routes from pointing to
groups that only contain IPv6 nexthops. Example:

# ip nexthop replace id 1 via 192.0.2.2 dev dummy10
# ip nexthop replace id 2 via 2001:db8:1::2 dev dummy10
# ip nexthop replace id 10 group 1/2
# ip nexthop del id 1
# ip route replace 2001:db8:10::/64 nhid 10
Error: IPv6 routes can not use an IPv4 nexthop.

Solve this by updating the indication according to the new set of
member nexthops.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
233c63785c ipv4: nexthop: Remove unnecessary rtnl_dereference()
The pointer is not RCU protected, so remove the unnecessary
rtnl_dereference(). This suppresses the following warning:

net/ipv4/nexthop.c:1101:24: error: incompatible types in comparison expression (different address spaces):
net/ipv4/nexthop.c:1101:24:    struct rb_node [noderef] __rcu *
net/ipv4/nexthop.c:1101:24:    struct rb_node *

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
33d80996b8 ipv4: nexthop: Use nla_put_be32() for NHA_GATEWAY
The code correctly uses nla_get_be32() to get the payload of the
attribute, but incorrectly uses nla_put_u32() to add the attribute to
the payload. This results in the following warning:

net/ipv4/nexthop.c:279:59: warning: incorrect type in argument 3 (different base types)
net/ipv4/nexthop.c:279:59:    expected unsigned int [usertype] value
net/ipv4/nexthop.c:279:59:    got restricted __be32 [usertype] ipv4

Suppress the warning by using nla_put_be32().

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
d7d49dc77c ipv4: nexthop: Reduce allocation size of 'struct nh_group'
The struct looks as follows:

struct nh_group {
	struct nh_group		*spare; /* spare group for removals */
	u16			num_nh;
	bool			mpath;
	bool			fdb_nh;
	bool			has_v4;
	struct nh_grp_entry	nh_entries[];
};

But its offset within 'struct nexthop' is also taken into account to
determine the allocation size.

Instead, use struct_size() to allocate only the required number of
bytes.

Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 16:00:51 -07:00
Ido Schimmel
7f6f32bb7d ipv4: Silence suspicious RCU usage warning
fib_info_notify_update() is always called with RTNL held, but not from
an RCU read-side critical section. This leads to the following warning
[1] when the FIB table list is traversed with
hlist_for_each_entry_rcu(), but without a proper lockdep expression.

Since modification of the list is protected by RTNL, silence the warning
by adding a lockdep expression which verifies RTNL is held.

[1]
 =============================
 WARNING: suspicious RCU usage
 5.9.0-rc1-custom-14233-g2f26e122d62f #129 Not tainted
 -----------------------------
 net/ipv4/fib_trie.c:2124 RCU-list traversed in non-reader section!!

 other info that might help us debug this:

 rcu_scheduler_active = 2, debug_locks = 1
 1 lock held by ip/834:
  #0: ffffffff85a3b6b0 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0

 stack backtrace:
 CPU: 0 PID: 834 Comm: ip Not tainted 5.9.0-rc1-custom-14233-g2f26e122d62f #129
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
 Call Trace:
  dump_stack+0x100/0x184
  lockdep_rcu_suspicious+0x143/0x14d
  fib_info_notify_update+0x8d1/0xa60
  __nexthop_replace_notify+0xd2/0x290
  rtm_new_nexthop+0x35e2/0x5946
  rtnetlink_rcv_msg+0x4f7/0xbd0
  netlink_rcv_skb+0x17a/0x480
  rtnetlink_rcv+0x22/0x30
  netlink_unicast+0x5ae/0x890
  netlink_sendmsg+0x98a/0xf40
  ____sys_sendmsg+0x879/0xa00
  ___sys_sendmsg+0x122/0x190
  __sys_sendmsg+0x103/0x1d0
  __x64_sys_sendmsg+0x7d/0xb0
  do_syscall_64+0x32/0x50
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fde28c3be57
 Code: 0c 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51
c3 48 83 ec 28 89 54 24 1c 48 89 74 24 10
RSP: 002b:00007ffc09330028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fde28c3be57
RDX: 0000000000000000 RSI: 00007ffc09330090 RDI: 0000000000000003
RBP: 000000005f45f911 R08: 0000000000000001 R09: 00007ffc0933012c
R10: 0000000000000076 R11: 0000000000000246 R12: 0000000000000001
R13: 00007ffc09330290 R14: 00007ffc09330eee R15: 00005610e48ed020

Fixes: 1bff1a0c9b ("ipv4: Add function to send route updates")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 15:58:48 -07:00
Florian Westphal
1cec170d45 mptcp: free acked data before waiting for more memory
After subflow lock is dropped, more wmem might have been made available.

This fixes a deadlock in mptcp_connect.sh 'mmap' mode: wmem is exhausted.
But as the mptcp socket holds on to already-acked data (for retransmit)
no wakeup will occur.

Using 'goto restart' calls mptcp_clean_una(sk) which will free pages
that have been acked completely in the mean time.

Fixes: fb529e62d3 ("mptcp: break and restart in case mptcp sndbuf is full")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 15:48:44 -07:00
Vinicius Costa Gomes
09e31cf0c5 taprio: Fix using wrong queues in gate mask
Since commit 9c66d15646 ("taprio: Add support for hardware
offloading") there's a bit of inconsistency when offloading schedules
to the hardware:

In software mode, the gate masks are specified in terms of traffic
classes, so if say "sched-entry S 03 20000", it means that the traffic
classes 0 and 1 are open for 20us; when taprio is offloaded to
hardware, the gate masks are specified in terms of hardware queues.

The idea here is to fix hardware offloading, so schedules in hardware
and software mode have the same behavior. What's needed to do is to
map traffic classes to queues when applying the offload to the driver.

Fixes: 9c66d15646 ("taprio: Add support for hardware offloading")
Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-26 15:45:11 -07:00
Chuck Lever
5de55ce951 xprtrdma: Release in-flight MRs on disconnect
Dan Aloni reports that when a server disconnects abruptly, a few
memory regions are left DMA mapped. Over time this leak could pin
enough I/O resources to slow or even deadlock an NFS/RDMA client.

I found that if a transport disconnects before pending Send and
FastReg WRs can be posted, the to-be-registered MRs are stranded on
the req's rl_registered list and never released -- since they
weren't posted, there's no Send completion to DMA unmap them.

Reported-by: Dan Aloni <dan@kernelim.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2020-08-26 15:29:21 -04:00
Linus Torvalds
2ac69819ba Fixes:
- Eliminate an oops introduced in v5.8
 - Remove a duplicate #include added by nfsd-5.9
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl8/4gUACgkQM2qzM29m
 f5cxKBAAp7UjD3YNlLhSviowuOfYpWNjyk1cEQ6hWFA9oVeSfZfU3/axW8uYTHPm
 QZ6ams6gjorP4CXwVkFGFHpTRg4CfVN9g5lKxrcjvELqNllWBhE9UupRgbX3+XBE
 qselRI22M64o2tfDE+tPrDB8w8PwHmqrHwRydXfgiFlHk7nt6xD7NitaJBnPlYPM
 21OBl6mrjLwtRwvX9n5wpy/+bfOTHbGV5VNez0fAfKXggNmRdt/UNROC4doLg4M0
 2khAV3vgx49FRpCPL6SZPcBYd6zfrYOcj3iSf6wpxS5nTb2MifXFqz1MvKRTj863
 gzvSmh7vuf0+EaOAXuLjCD9dURZpuG/k0vJGijOgaSt0+vNQHjIgZ1XRFHQtQCp4
 zPJ/Qyk5k7uajHzcBFuNPUFAkOovH6LRoOzpqGvXhwaxrMPWti0LyyVKidVJrt/d
 EtOKQR+HCN0zAwjadXSPK8Nw1PjMzplkF7TaxXvF2LdO/4vpEZZNoz+if59gRcFY
 65h2++7y+0MCX8l83uUZfs+jQU2aR1w5a0DjVzi86xzJtyhr6gEyTj3Z6L9HIHwW
 dnSpUmoiaCoN0eqxvEBjw0VEPqB806CuiUER0Jdd8k7mPk04fsQ/9+UsYyliSLEG
 N56LFSWLXLHsySa2WkuB/ghzT2/Q0vFoZKXW0KNSD7W4C5XMxi4=
 =czB3
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull nfs server fixes from Chuck Lever:

 - Eliminate an oops introduced in v5.8

 - Remove a duplicate #include added by nfsd-5.9

* tag 'nfsd-5.9-1' of git://git.linux-nfs.org/projects/cel/cel-2.6:
  SUNRPC: remove duplicate include
  nfsd: fix oops on mixed NFSv4/NFSv3 client access
2020-08-25 18:01:36 -07:00
KP Singh
30897832d8 bpf: Allow local storage to be used from LSM programs
Adds support for both bpf_{sk, inode}_storage_{get, delete} to be used
in LSM programs. These helpers are not used for tracing programs
(currently) as their usage is tied to the life-cycle of the object and
should only be used where the owning object won't be freed (when the
owning object is passed as an argument to the LSM hook). Thus, they
are safer to use in LSM hooks than tracing. Usage of local storage in
tracing programs will probably follow a per function based whitelist
approach.

Since the UAPI helper signature for bpf_sk_storage expect a bpf_sock,
it, leads to a compilation warning for LSM programs, it's also updated
to accept a void * pointer instead.

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-7-kpsingh@chromium.org
2020-08-25 15:00:04 -07:00
KP Singh
450af8d0f6 bpf: Split bpf_local_storage to bpf_sk_storage
A purely mechanical change:

	bpf_sk_storage.c = bpf_sk_storage.c + bpf_local_storage.c
	bpf_sk_storage.h = bpf_sk_storage.h + bpf_local_storage.h

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-5-kpsingh@chromium.org
2020-08-25 15:00:04 -07:00
KP Singh
f836a56e84 bpf: Generalize bpf_sk_storage
Refactor the functionality in bpf_sk_storage.c so that concept of
storage linked to kernel objects can be extended to other objects like
inode, task_struct etc.

Each new local storage will still be a separate map and provide its own
set of helpers. This allows for future object specific extensions and
still share a lot of the underlying implementation.

This includes the changes suggested by Martin in:

  https://lore.kernel.org/bpf/20200725013047.4006241-1-kafai@fb.com/

adding new map operations to support bpf_local_storage maps:

* storages for different kernel objects to optionally have different
  memory charging strategy (map_local_storage_charge,
  map_local_storage_uncharge)
* Functionality to extract the storage pointer from a pointer to the
  owning object (map_owner_storage_ptr)

Co-developed-by: Martin KaFai Lau <kafai@fb.com>

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-4-kpsingh@chromium.org
2020-08-25 15:00:04 -07:00
KP Singh
4cc9ce4e73 bpf: Generalize caching for sk_storage.
Provide the a ability to define local storage caches on a per-object
type basis. The caches and caching indices for different objects should
not be inter-mixed as suggested in:

  https://lore.kernel.org/bpf/20200630193441.kdwnkestulg5erii@kafai-mbp.dhcp.thefacebook.com/

  "Caching a sk-storage at idx=0 of a sk should not stop an
  inode-storage to be cached at the same idx of a inode."

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-3-kpsingh@chromium.org
2020-08-25 15:00:03 -07:00
KP Singh
1f00d375af bpf: Renames in preparation for bpf_local_storage
A purely mechanical change to split the renaming from the actual
generalization.

Flags/consts:

  SK_STORAGE_CREATE_FLAG_MASK	BPF_LOCAL_STORAGE_CREATE_FLAG_MASK
  BPF_SK_STORAGE_CACHE_SIZE	BPF_LOCAL_STORAGE_CACHE_SIZE
  MAX_VALUE_SIZE		BPF_LOCAL_STORAGE_MAX_VALUE_SIZE

Structs:

  bucket			bpf_local_storage_map_bucket
  bpf_sk_storage_map		bpf_local_storage_map
  bpf_sk_storage_data		bpf_local_storage_data
  bpf_sk_storage_elem		bpf_local_storage_elem
  bpf_sk_storage		bpf_local_storage

The "sk" member in bpf_local_storage is also updated to "owner"
in preparation for changing the type to void * in a subsequent patch.

Functions:

  selem_linked_to_sk			selem_linked_to_storage
  selem_alloc				bpf_selem_alloc
  __selem_unlink_sk			bpf_selem_unlink_storage_nolock
  __selem_link_sk			bpf_selem_link_storage_nolock
  selem_unlink_sk			__bpf_selem_unlink_storage
  sk_storage_update			bpf_local_storage_update
  __sk_storage_lookup			bpf_local_storage_lookup
  bpf_sk_storage_map_free		bpf_local_storage_map_free
  bpf_sk_storage_map_alloc		bpf_local_storage_map_alloc
  bpf_sk_storage_map_alloc_check	bpf_local_storage_map_alloc_check
  bpf_sk_storage_map_check_btf		bpf_local_storage_map_check_btf

Signed-off-by: KP Singh <kpsingh@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200825182919.1118197-2-kpsingh@chromium.org
2020-08-25 14:59:58 -07:00
Joe Perches
ca65a280fb sunrpc: Avoid comma separated statements
Use semicolons and braces.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 07:54:19 -07:00
Joe Perches
dee847793f ipv6: fib6: Avoid comma separated statements
Use semicolons and braces.

Signed-off-by: Joe Perches <joe@perches.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 07:54:19 -07:00
Tong Zhang
e104684108 net: caif: fix error code handling
cfpkt_peek_head return 0 and 1, caller is checking error using <0

Signed-off-by: Tong Zhang <ztong0001@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 07:50:25 -07:00
Miaohe Lin
343d8c6014 net: clean up codestyle for net/ipv4
This is a pure codestyle cleanup patch. Also add a blank line after
declarations as warned by checkpatch.pl.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:28:02 -07:00
Miaohe Lin
fdf1923bf9 net: Remove duplicated midx check against 0
Check midx against 0 is always equal to check midx against sk_bound_dev_if
when sk_bound_dev_if is known not equal to 0 in these case.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:23:59 -07:00
Miaohe Lin
0ce779a9f5 net: Avoid unnecessary inet_addr_type() call when addr is INADDR_ANY
We can avoid unnecessary inet_addr_type() call by check addr against
INADDR_ANY first.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:20:10 -07:00
Miaohe Lin
0316a21116 net: Set ping saddr after we successfully get the ping port
We can defer set ping saddr until we successfully get the ping port. So we
can avoid clear saddr when failed. Since ping_clear_saddr() is not used
anymore now, remove it.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:18:13 -07:00
Miaohe Lin
8b4510d76c net: gain ipv4 mtu when mtu is not locked
When mtu is locked, we should not obtain ipv4 mtu as we return immediately
in this case and leave acquired ipv4 mtu unused.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-25 06:04:39 -07:00
David S. Miller
079f921e9f This cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - Drop unused function batadv_hardif_remove_interfaces(),
    by Sven Eckelmann
 
  - delete duplicated words, by Randy Dunlap
 
  - Drop (even more) repeated words in comments, by Sven Eckelmann
 
  - Migrate to linux/prandom.h, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl9D6i0WHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoUU7D/sEZhE7ht0F7tf5hyl8AlCaPpS4
 Yd2IwrI1m4Vc8Z0gd4d3MtdCAilGsLVY32PY/+MQBRktvEZ+Ms9CD+MaVJJiCWnU
 OxhBZxNYeE/omP/nPZgwub/kHcIojGd/1XTy6Gyp2IGroOvAB3BA0acLsZsAoFrR
 /AmRLFFajnYgodn0yuWeomDjOaUoe5n4GCpWutlb6bLRA9t1B2YMWtRj5vXJO+Ht
 EIxBw3Gvz/b7eVsbo0VQMpsOju8OM9f6XZw66NnS2ekEepIBrkJzIi8fa97H7hy6
 Ef6XkwmZhLgQQbsylVZSsynkNoB6cVOPDg/k4/BrErYTMnqAoh2U6v8IFOssmTKX
 Od5FBRqcXnRB2+/sPq7+VodlFXTqSriJGdlwCyGyg7dU4FPPdbdwubdJXWMjwe9A
 7yMrty3VRLmVAwd5JuefNekyzEXXf2Pd/HHhDdgnLJ/LkJAgkswpmbjkz8kNZAwp
 69MbFiC027O4s1l8PdiREjfAcMG1SicQ08LhBA8OV2ZUQ7IcYlg0Hq/UQNTEuJni
 yPFnzA5n/KtbY5sJGA+oYHGfQiNLx3xmM/GNl5YpfYrDSg8srcd+GTvwy2Vzzzp1
 2Q7KSxPHVJDdcLPgyUlfTXp27870/aPY0Q+8r5fKjPgq9Fj4ZAuop7FtY8XSzg6R
 3N4lPNKQS38UUPfjOg==
 =1GXQ
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20200824' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Drop unused function batadv_hardif_remove_interfaces(),
   by Sven Eckelmann

 - delete duplicated words, by Randy Dunlap

 - Drop (even more) repeated words in comments, by Sven Eckelmann

 - Migrate to linux/prandom.h, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:18:22 -07:00
David S. Miller
99408c422d Here are some batman-adv bugfixes:
- Avoid uninitialized memory access when handling DHCP, by Sven Eckelmann
 
  - Fix check for own OGM in OGM receive handler, by Linus Luessing
 
  - Fix netif_rx access for non-interrupt context in BLA, by Jussi Kivilinna
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAl9D6JUWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoez3EADD/QxTk5CWF9jU7RnyLEmoRjGe
 GAoJnPbZV7GvyQADOFmrqYMvUv/4Vly8Tj8DBI3Sc8gNFNb3/jrBue/AaqXFKFh0
 GFmh5MoRjzpz3xJRzOn+OxoDaC+ooeXNO3VIHNB76uTrgo1HToWrg5qtoT9KjGVT
 kDr0+os+9MuPh6qQKfnW79wVedYGwuDI1qb4ieHTWEyIU+A82TDYc5Zf3CqcPrlX
 WagIL1pNsMb7/BU9I1br9Q1ccG1cX/Icqaq/XQEiNRWsa0vjb084fKrq9jaQYalB
 AASDVh4aRbfJ1EatZAydX7AbmfDYPPg6THQsJAJphIJ1LBSFcv2+CMyT9Xkj1tGx
 5rvMOAcO37VAv8+qwwH7DXmh3nOm2N9oglIYAGyBU3nf6rDeDntVOpraRMWr8eHt
 h2B+jJ9XSvowPJ4fTM86oqgxTUExYoyWHA1WkkX0VkcGXEKxpaSklSd1BcoSB707
 3ItpxMYgFhLx12uZQ1bsnMuazOSer7rbpiZqmKx40uiVgyBB97d/udbNE0yh8LVM
 fCVwnnjNFsuKu2Qh4asvpKUhryKD/+1NOdWsrKZEJs6WAVmWJh1S3/MmZPpek0lQ
 pVEmf0JBp6yjfJdIgTb90UfoCRAEZ49OWmXelCjdSLx8k14j17geQxQvwcg3VGYN
 EsCTj4Ls2DsHpabhBQ==
 =ikYK
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20200824' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
Here are some batman-adv bugfixes:

 - Avoid uninitialized memory access when handling DHCP, by Sven Eckelmann

 - Fix check for own OGM in OGM receive handler, by Linus Luessing

 - Fix netif_rx access for non-interrupt context in BLA, by Jussi Kivilinna
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:16:56 -07:00
Miaohe Lin
373c15c2e9 net: Use helper macro RT_TOS() in __icmp_send()
Use helper macro RT_TOS() to get tos in __icmp_send().

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:12:36 -07:00
Miaohe Lin
7551144978 net: Avoid access icmp_err_convert when icmp code is ICMP_FRAG_NEEDED
There is no need to fetch errno and fatal info from icmp_err_convert when
icmp code is ICMP_FRAG_NEEDED.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 18:11:43 -07:00
Randy Dunlap
5463352776 net: dccp: delete repeated words
Drop duplicated words in /net/dccp/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Gerrit Renker <gerrit@erg.abdn.ac.uk>
Cc: dccp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 17:31:20 -07:00
Randy Dunlap
8540591885 net: netlink: delete repeated words
Drop duplicated words in net/netlink/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 17:31:20 -07:00
Randy Dunlap
2bdcc73c88 net: ipv4: delete repeated words
Drop duplicate words in comments in net/ipv4/.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 17:31:20 -07:00
Randy Dunlap
b8d7a7c62c net: sctp: ulpqueue.c: delete duplicated word
Drop the repeated word "an".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Randy Dunlap
14f45bb7b1 net: sctp: sm_make_chunk.c: delete duplicated words + fix typo
Drop the repeated words "for", "that", and "a".
Change "his" to "this".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Randy Dunlap
93c3216a71 net: sctp: protocol.c: delete duplicated words + punctuation
Drop the repeated words "of" and "that".
Add some punctuation for readability.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Randy Dunlap
9932564f12 net: sctp: chunk.c: delete duplicated word
Drop the repeated word "the".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Randy Dunlap
440d399033 net: sctp: bind_addr.c: delete duplicated word
Drop the repeated word "of".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Randy Dunlap
861e7021ae net: sctp: auth.c: delete duplicated words
Drop the repeated word "the" and "now".

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Randy Dunlap
5e80a0ccbc net: sctp: associola.c: delete duplicated words
Drop the repeated word "the" in two places.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: linux-sctp@vger.kernel.org
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:21:43 -07:00
Luke Hsiao
583bbf0624 io_uring: allow tcp ancillary data for __sys_recvmsg_sock()
For TCP tx zero-copy, the kernel notifies the process of completions by
queuing completion notifications on the socket error queue. This patch
allows reading these notifications via recvmsg to support TCP tx
zero-copy.

Ancillary data was originally disallowed due to privilege escalation
via io_uring's offloading of sendmsg() onto a kernel thread with kernel
credentials (https://crbug.com/project-zero/1975). So, we must ensure
that the socket type is one where the ancillary data types that are
delivered on recvmsg are plain data (no file descriptors or values that
are translated based on the identity of the calling process).

This was tested by using io_uring to call recvmsg on the MSG_ERRQUEUE
with tx zero-copy enabled. Before this patch, we received -EINVALID from
this specific code path. After this patch, we could read tcp tx
zero-copy completion notifications from the MSG_ERRQUEUE.

Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Arjun Roy <arjunroy@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jann Horn <jannh@google.com>
Reviewed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Luke Hsiao <lukehsiao@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:16:06 -07:00
Herbert Xu
be769db2f9 net: Get rid of consume_skb when tracing is off
The function consume_skb is only meaningful when tracing is enabled.
This patch makes it conditional on CONFIG_TRACEPOINTS.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:12:26 -07:00
Paul Moore
d3b990b7f3 netlabel: fix problems with mapping removal
This patch fixes two main problems seen when removing NetLabel
mappings: memory leaks and potentially extra audit noise.

The memory leaks are caused by not properly free'ing the mapping's
address selector struct when free'ing the entire entry as well as
not properly cleaning up a temporary mapping entry when adding new
address selectors to an existing entry.  This patch fixes both these
problems such that kmemleak reports no NetLabel associated leaks
after running the SELinux test suite.

The potentially extra audit noise was caused by the auditing code in
netlbl_domhsh_remove_entry() being called regardless of the entry's
validity.  If another thread had already marked the entry as invalid,
but not removed/free'd it from the list of mappings, then it was
possible that an additional mapping removal audit record would be
generated.  This patch fixes this by returning early from the removal
function when the entry was previously marked invalid.  This change
also had the side benefit of improving the code by decreasing the
indentation level of large chunk of code by one (accounting for most
of the diffstat).

Fixes: 63c4168874 ("netlabel: Add network address selectors to the NetLabel/LSM domain mapping")
Reported-by: Stephen Smalley <stephen.smalley.work@gmail.com>
Signed-off-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:08:00 -07:00
Parav Pandit
5d080b5064 devlink: Protect devlink port list traversal
Cited patch in fixes tag misses to protect port list traversal
while traversing per port reporter list.

Protect it using devlink instance lock.

Fixes: f4f5416601 ("devlink: Implement devlink health reporters on per-port basis")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:02:47 -07:00
Parav Pandit
79604c5de2 devlink: Fix per port reporter fields initialization
Cited patch in fixes tag initializes reporters_list and reporters_lock
of a devlink port after devlink port is added to the list. Once port
is added to the list, devlink_nl_cmd_health_reporter_get_dumpit()
can access the uninitialized mutex and reporters list head.
Fix it by initializing port reporters field before adding port to the
list.

Fixes: f4f5416601 ("devlink: Implement devlink health reporters on per-port basis")
Signed-off-by: Parav Pandit <parav@nvidia.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 16:02:46 -07:00
Xin Long
3106ecb43a sctp: not disable bh in the whole sctp_get_port_local()
With disabling bh in the whole sctp_get_port_local(), when
snum == 0 and too many ports have been used, the do-while
loop will take the cpu for a long time and cause cpu stuck:

  [ ] watchdog: BUG: soft lockup - CPU#11 stuck for 22s!
  [ ] RIP: 0010:native_queued_spin_lock_slowpath+0x4de/0x940
  [ ] Call Trace:
  [ ]  _raw_spin_lock+0xc1/0xd0
  [ ]  sctp_get_port_local+0x527/0x650 [sctp]
  [ ]  sctp_do_bind+0x208/0x5e0 [sctp]
  [ ]  sctp_autobind+0x165/0x1e0 [sctp]
  [ ]  sctp_connect_new_asoc+0x355/0x480 [sctp]
  [ ]  __sctp_connect+0x360/0xb10 [sctp]

There's no need to disable bh in the whole function of
sctp_get_port_local. So fix this cpu stuck by removing
local_bh_disable() called at the beginning, and using
spin_lock_bh() instead.

The same thing was actually done for inet_csk_get_port() in
Commit ea8add2b19 ("tcp/dccp: better use of ephemeral
ports in bind()").

Thanks to Marcelo for pointing the buggy code out.

v1->v2:
  - use cond_resched() to yield cpu to other tasks if needed,
    as Eric noticed.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 15:50:17 -07:00
Martin KaFai Lau
267cf9fa43 tcp: bpf: Optionally store mac header in TCP_SAVE_SYN
This patch is adapted from Eric's patch in an earlier discussion [1].

The TCP_SAVE_SYN currently only stores the network header and
tcp header.  This patch allows it to optionally store
the mac header also if the setsockopt's optval is 2.

It requires one more bit for the "save_syn" bit field in tcp_sock.
This patch achieves this by moving the syn_smc bit next to the is_mptcp.
The syn_smc is currently used with the TCP experimental option.  Since
syn_smc is only used when CONFIG_SMC is enabled, this patch also puts
the "IS_ENABLED(CONFIG_SMC)" around it like the is_mptcp did
with "IS_ENABLED(CONFIG_MPTCP)".

The mac_hdrlen is also stored in the "struct saved_syn"
to allow a quick offset from the bpf prog if it chooses to start
getting from the network header or the tcp header.

[1]: https://lore.kernel.org/netdev/CANn89iLJNWh6bkH7DNhy_kmcAexuUCccqERqe7z2QsvPhGrYPQ@mail.gmail.com/

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190123.2886935-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
0813a84156 bpf: tcp: Allow bpf prog to write and parse TCP header option
[ Note: The TCP changes here is mainly to implement the bpf
  pieces into the bpf_skops_*() functions introduced
  in the earlier patches. ]

The earlier effort in BPF-TCP-CC allows the TCP Congestion Control
algorithm to be written in BPF.  It opens up opportunities to allow
a faster turnaround time in testing/releasing new congestion control
ideas to production environment.

The same flexibility can be extended to writing TCP header option.
It is not uncommon that people want to test new TCP header option
to improve the TCP performance.  Another use case is for data-center
that has a more controlled environment and has more flexibility in
putting header options for internal only use.

For example, we want to test the idea in putting maximum delay
ACK in TCP header option which is similar to a draft RFC proposal [1].

This patch introduces the necessary BPF API and use them in the
TCP stack to allow BPF_PROG_TYPE_SOCK_OPS program to parse
and write TCP header options.  It currently supports most of
the TCP packet except RST.

Supported TCP header option:
───────────────────────────
This patch allows the bpf-prog to write any option kind.
Different bpf-progs can write its own option by calling the new helper
bpf_store_hdr_opt().  The helper will ensure there is no duplicated
option in the header.

By allowing bpf-prog to write any option kind, this gives a lot of
flexibility to the bpf-prog.  Different bpf-prog can write its
own option kind.  It could also allow the bpf-prog to support a
recently standardized option on an older kernel.

Sockops Callback Flags:
──────────────────────
The bpf program will only be called to parse/write tcp header option
if the following newly added callback flags are enabled
in tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG

A few words on the PARSE CB flags.  When the above PARSE CB flags are
turned on, the bpf-prog will be called on packets received
at a sk that has at least reached the ESTABLISHED state.
The parsing of the SYN-SYNACK-ACK will be discussed in the
"3 Way HandShake" section.

The default is off for all of the above new CB flags, i.e. the bpf prog
will not be called to parse or write bpf hdr option.  There are
details comment on these new cb flags in the UAPI bpf.h.

sock_ops->skb_data and bpf_load_hdr_opt()
─────────────────────────────────────────
sock_ops->skb_data and sock_ops->skb_data_end covers the whole
TCP header and its options.  They are read only.

The new bpf_load_hdr_opt() helps to read a particular option "kind"
from the skb_data.

Please refer to the comment in UAPI bpf.h.  It has details
on what skb_data contains under different sock_ops->op.

3 Way HandShake
───────────────
The bpf-prog can learn if it is sending SYN or SYNACK by reading the
sock_ops->skb_tcp_flags.

* Passive side

When writing SYNACK (i.e. sock_ops->op == BPF_SOCK_OPS_WRITE_HDR_OPT_CB),
the received SYN skb will be available to the bpf prog.  The bpf prog can
use the SYN skb (which may carry the header option sent from the remote bpf
prog) to decide what bpf header option should be written to the outgoing
SYNACK skb.  The SYN packet can be obtained by getsockopt(TCP_BPF_SYN*).
More on this later.  Also, the bpf prog can learn if it is in syncookie
mode (by checking sock_ops->args[0] == BPF_WRITE_HDR_TCP_SYNACK_COOKIE).

The bpf prog can store the received SYN pkt by using the existing
bpf_setsockopt(TCP_SAVE_SYN).  The example in a later patch does it.
[ Note that the fullsock here is a listen sk, bpf_sk_storage
  is not very useful here since the listen sk will be shared
  by many concurrent connection requests.

  Extending bpf_sk_storage support to request_sock will add weight
  to the minisock and it is not necessary better than storing the
  whole ~100 bytes SYN pkt. ]

When the connection is established, the bpf prog will be called
in the existing PASSIVE_ESTABLISHED_CB callback.  At that time,
the bpf prog can get the header option from the saved syn and
then apply the needed operation to the newly established socket.
The later patch will use the max delay ack specified in the SYN
header and set the RTO of this newly established connection
as an example.

The received ACK (that concludes the 3WHS) will also be available to
the bpf prog during PASSIVE_ESTABLISHED_CB through the sock_ops->skb_data.
It could be useful in syncookie scenario.  More on this later.

There is an existing getsockopt "TCP_SAVED_SYN" to return the whole
saved syn pkt which includes the IP[46] header and the TCP header.
A few "TCP_BPF_SYN*" getsockopt has been added to allow specifying where to
start getting from, e.g. starting from TCP header, or from IP[46] header.

The new getsockopt(TCP_BPF_SYN*) will also know where it can get
the SYN's packet from:
  - (a) the just received syn (available when the bpf prog is writing SYNACK)
        and it is the only way to get SYN during syncookie mode.
  or
  - (b) the saved syn (available in PASSIVE_ESTABLISHED_CB and also other
        existing CB).

The bpf prog does not need to know where the SYN pkt is coming from.
The getsockopt(TCP_BPF_SYN*) will hide this details.

Similarly, a flags "BPF_LOAD_HDR_OPT_TCP_SYN" is also added to
bpf_load_hdr_opt() to read a particular header option from the SYN packet.

* Fastopen

Fastopen should work the same as the regular non fastopen case.
This is a test in a later patch.

* Syncookie

For syncookie, the later example patch asks the active
side's bpf prog to resend the header options in ACK.  The server
can use bpf_load_hdr_opt() to look at the options in this
received ACK during PASSIVE_ESTABLISHED_CB.

* Active side

The bpf prog will get a chance to write the bpf header option
in the SYN packet during WRITE_HDR_OPT_CB.  The received SYNACK
pkt will also be available to the bpf prog during the existing
ACTIVE_ESTABLISHED_CB callback through the sock_ops->skb_data
and bpf_load_hdr_opt().

* Turn off header CB flags after 3WHS

If the bpf prog does not need to write/parse header options
beyond the 3WHS, the bpf prog can clear the bpf_sock_ops_cb_flags
to avoid being called for header options.
Or the bpf-prog can select to leave the UNKNOWN_HDR_OPT_CB_FLAG on
so that the kernel will only call it when there is option that
the kernel cannot handle.

[1]: draft-wang-tcpm-low-latency-opt-00
     https://tools.ietf.org/html/draft-wang-tcpm-low-latency-opt-00

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190104.2885895-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
c9985d09e1 bpf: sock_ops: Change some members of sock_ops_kern from u32 to u8
A later patch needs to add a few pointers and a few u8 to
sock_ops_kern.  Hence, this patch saves some spaces by moving
some of the existing members from u32 to u8 so that the later
patch can still fit everything in a cacheline.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200820190058.2885640-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
331fca4315 bpf: tcp: Add bpf_skops_hdr_opt_len() and bpf_skops_write_hdr_opt()
The bpf prog needs to parse the SYN header to learn what options have
been sent by the peer's bpf-prog before writing its options into SYNACK.
This patch adds a "syn_skb" arg to tcp_make_synack() and send_synack().
This syn_skb will eventually be made available (as read-only) to the
bpf prog.  This will be the only SYN packet available to the bpf
prog during syncookie.  For other regular cases, the bpf prog can
also use the saved_syn.

When writing options, the bpf prog will first be called to tell the
kernel its required number of bytes.  It is done by the new
bpf_skops_hdr_opt_len().  The bpf prog will only be called when the new
BPF_SOCK_OPS_WRITE_HDR_OPT_CB_FLAG is set in tp->bpf_sock_ops_cb_flags.
When the bpf prog returns, the kernel will know how many bytes are needed
and then update the "*remaining" arg accordingly.  4 byte alignment will
be included in the "*remaining" before this function returns.  The 4 byte
aligned number of bytes will also be stored into the opts->bpf_opt_len.
"bpf_opt_len" is a newly added member to the struct tcp_out_options.

Then the new bpf_skops_write_hdr_opt() will call the bpf prog to write the
header options.  The bpf prog is only called if it has reserved spaces
before (opts->bpf_opt_len > 0).

The bpf prog is the last one getting a chance to reserve header space
and writing the header option.

These two functions are half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190052.2885316-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
00d211a4ea bpf: tcp: Add bpf_skops_parse_hdr()
The patch adds a function bpf_skops_parse_hdr().
It will call the bpf prog to parse the TCP header received at
a tcp_sock that has at least reached the ESTABLISHED state.

For the packets received during the 3WHS (SYN, SYNACK and ACK),
the received skb will be available to the bpf prog during the callback
in bpf_skops_established() introduced in the previous patch and
in the bpf_skops_write_hdr_opt() that will be added in the
next patch.

Calling bpf prog to parse header is controlled by two new flags in
tp->bpf_sock_ops_cb_flags:
BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG and
BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG.

When BPF_SOCK_OPS_PARSE_UNKNOWN_HDR_OPT_CB_FLAG is set,
the bpf prog will only be called when there is unknown
option in the TCP header.

When BPF_SOCK_OPS_PARSE_ALL_HDR_OPT_CB_FLAG is set,
the bpf prog will be called on all received TCP header.

This function is half implemented to highlight the changes in
TCP stack.  The actual codes preparing the bpf running context and
invoking the bpf prog will be added in the later patch with other
necessary bpf pieces.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/bpf/20200820190046.2885054-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
72be0fe6ba bpf: tcp: Add bpf_skops_established()
In tcp_init_transfer(), it currently calls the bpf prog to give it a
chance to handle the just "ESTABLISHED" event (e.g. do setsockopt
on the newly established sk).  Right now, it is done by calling the
general purpose tcp_call_bpf().

In the later patch, it also needs to pass the just-received skb which
concludes the 3 way handshake. E.g. the SYNACK received at the active side.
The bpf prog can then learn some specific header options written by the
peer's bpf-prog and potentially do setsockopt on the newly established sk.
Thus, instead of reusing the general purpose tcp_call_bpf(), a new function
bpf_skops_established() is added to allow passing the "skb" to the bpf
prog.  The actual skb passing from bpf_skops_established() to the bpf prog
will happen together in a later patch which has the necessary bpf pieces.

A "skb" arg is also added to tcp_init_transfer() such that
it can then be passed to bpf_skops_established().

Calling the new bpf_skops_established() instead of tcp_call_bpf()
should be a noop in this patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190039.2884750-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
7656d68455 tcp: Add saw_unknown to struct tcp_options_received
In a later patch, the bpf prog only wants to be called to handle
a header option if that particular header option cannot be handled by
the kernel.  This unknown option could be written by the peer's bpf-prog.
It could also be a new standard option that the running kernel does not
support it while a bpf-prog can handle it.

This patch adds a "saw_unknown" bit to "struct tcp_options_received"
and it uses an existing one byte hole to do that.  "saw_unknown" will
be set in tcp_parse_options() if it sees an option that the kernel
cannot handle.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190033.2884430-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
ca584ba070 tcp: bpf: Add TCP_BPF_RTO_MIN for bpf_setsockopt
This patch adds bpf_setsockopt(TCP_BPF_RTO_MIN) to allow bpf prog
to set the min rto of a connection.  It could be used together
with the earlier patch which has added bpf_setsockopt(TCP_BPF_DELACK_MAX).

A later selftest patch will communicate the max delay ack in a
bpf tcp header option and then the receiving side can use
bpf_setsockopt(TCP_BPF_RTO_MIN) to set a shorter rto.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190027.2884170-1-kafai@fb.com
2020-08-24 14:35:00 -07:00
Martin KaFai Lau
2b8ee4f05d tcp: bpf: Add TCP_BPF_DELACK_MAX setsockopt
This change is mostly from an internal patch and adapts it from sysctl
config to the bpf_setsockopt setup.

The bpf_prog can set the max delay ack by using
bpf_setsockopt(TCP_BPF_DELACK_MAX).  This max delay ack can be communicated
to its peer through bpf header option.  The receiving peer can then use
this max delay ack and set a potentially lower rto by using
bpf_setsockopt(TCP_BPF_RTO_MIN) which will be introduced
in the next patch.

Another later selftest patch will also use it like the above to show
how to write and parse bpf tcp header option.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190021.2884000-1-kafai@fb.com
2020-08-24 14:34:59 -07:00
Martin KaFai Lau
70a217f197 tcp: Use a struct to represent a saved_syn
The TCP_SAVE_SYN has both the network header and tcp header.
The total length of the saved syn packet is currently stored in
the first 4 bytes (u32) of an array and the actual packet data is
stored after that.

A later patch will add a bpf helper that allows to get the tcp header
alone from the saved syn without the network header.  It will be more
convenient to have a direct offset to a specific header instead of
re-parsing it.  This requires to separately store the network hdrlen.
The total header length (i.e. network + tcp) is still needed for the
current usage in getsockopt.  Although this total length can be obtained
by looking into the tcphdr and then get the (th->doff << 2), this patch
chooses to directly store the tcp hdrlen in the second four bytes of
this newly created "struct saved_syn".  By using a new struct, it can
give a readable name to each individual header length.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200820190014.2883694-1-kafai@fb.com
2020-08-24 14:34:59 -07:00
David S. Miller
a26aea2010 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Don't flag SCTP heartbeat as invalid for re-used connections,
   from Florian Westphal.

2) Bogus overlap report due to rbtree tree rotations, from Stefano Brivio.

3) Detect partial overlap with start end point match, also from Stefano.

4) Skip netlink dump of NFTA_SET_USERDATA is unset.

5) Incorrect nft_list_attributes enumeration definition.

6) Missing zeroing before memcpy to destination register, also
   from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-24 06:37:05 -07:00
Gustavo A. R. Silva
df561f6688 treewide: Use fallthrough pseudo-keyword
Replace the existing /* fall through */ comments and its variants with
the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary
fall-through markings when it is the case.

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
2020-08-23 17:36:59 -05:00
David S. Miller
7611cbb900 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2020-08-23 11:48:27 -07:00
Tom Parkin
eee049c0ef l2tp: remove tunnel and session debug flags field
The l2tp subsystem now uses standard kernel logging APIs for
informational and warning messages, and tracepoints for debug
information.

Now that the tunnel and session debug flags are unused, remove the field
from the core structures.

Various system calls (in the case of l2tp_ppp) and netlink messages
handle the getting and setting of debug flags.  To avoid userspace
breakage don't modify the API of these calls; simply ignore set
requests, and send dummy data for get requests.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
ac6ebaf06e l2tp: remove custom logging macros
All l2tp's informational and warning logging is now carried out using
standard kernel APIs.

Debugging information is now handled using tracepoints.

Now that no code is using the custom logging macros, remove them from
l2tp_core.h.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
6b7bdcd7ca l2tp: add tracepoints to l2tp_core.c
Add lifetime event tracing for tunnel and session instances, tracking
tunnel and session registration, deletion, and eventual freeing.

Port the data path sequence number debug logging to use trace points
rather than custom debug macros.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
2a03dd8e11 l2tp: add tracepoint definitions in trace.h
l2tp can provide a better debug experience using tracepoints rather than
printk-style logging.

Add tracepoint definitions in trace.h for use in the l2tp subsystem
code.

Add preprocessor definitions for the length of session and tunnel names
in l2tp_core.h so we can reuse these in trace.h.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
3f117d6f4b l2tp: add tracepoint infrastructure to core
The l2tp subsystem doesn't currently make use of tracepoints.

As a starting point for adding tracepoints, add skeleton infrastructure
for defining tracepoints for the subsystem, and for having them build
appropriately whether compiled into the kernel or built as a module.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
5ee759cda5 l2tp: use standard API for warning log messages
The l2tp_* log wrappers only emit messages of a given category if the
tunnel or session structure has the appropriate flag set in its debug
field.  Flags default to being unset.

For warning messages, this doesn't make a lot of sense since an
administrator is likely to want to know about datapath warnings without
needing to tweak the debug flags setting for a given tunnel or session
instance.

Modify l2tp_warn callsites to use pr_warn_ratelimited instead for
unconditional output of warning messages.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
ab141e3733 l2tp: remove noisy logging, use appropriate log levels
l2tp_ppp in particular had a lot of log messages for tracing
[get|set]sockopt calls.  These aren't especially useful, so remove
these messages.

Several log messages flagging error conditions were logged using
l2tp_info: they're better off as l2tp_warn.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Tom Parkin
12923365eb l2tp: don't log data frames
l2tp had logging to trace data frame receipt and transmission, including
code to dump packet contents.  This was originally intended to aid
debugging of core l2tp packet handling, but is of limited use now that
code is stable.

Signed-off-by: Tom Parkin <tparkin@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:44:37 -07:00
Nikolay Aleksandrov
eeaac3634e net: nexthop: don't allow empty NHA_GROUP
Currently the nexthop code will use an empty NHA_GROUP attribute, but it
requires at least 1 entry in order to function properly. Otherwise we
end up derefencing null or random pointers all over the place due to not
having any nh_grp_entry members allocated, nexthop code relies on having at
least the first member present. Empty NHA_GROUP doesn't make any sense so
just disallow it.
Also add a WARN_ON for any future users of nexthop_create_group().

 BUG: kernel NULL pointer dereference, address: 0000000000000080
 #PF: supervisor read access in kernel mode
 #PF: error_code(0x0000) - not-present page
 PGD 0 P4D 0
 Oops: 0000 [#1] SMP
 CPU: 0 PID: 558 Comm: ip Not tainted 5.9.0-rc1+ #93
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
 RIP: 0010:fib_check_nexthop+0x4a/0xaa
 Code: 0f 84 83 00 00 00 48 c7 02 80 03 f7 81 c3 40 80 fe fe 75 12 b8 ea ff ff ff 48 85 d2 74 6b 48 c7 02 40 03 f7 81 c3 48 8b 40 10 <48> 8b 80 80 00 00 00 eb 36 80 78 1a 00 74 12 b8 ea ff ff ff 48 85
 RSP: 0018:ffff88807983ba00 EFLAGS: 00010213
 RAX: 0000000000000000 RBX: ffff88807983bc00 RCX: 0000000000000000
 RDX: ffff88807983bc00 RSI: 0000000000000000 RDI: ffff88807bdd0a80
 RBP: ffff88807983baf8 R08: 0000000000000dc0 R09: 000000000000040a
 R10: 0000000000000000 R11: ffff88807bdd0ae8 R12: 0000000000000000
 R13: 0000000000000000 R14: ffff88807bea3100 R15: 0000000000000001
 FS:  00007f10db393700(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000080 CR3: 000000007bd0f004 CR4: 00000000003706f0
 Call Trace:
  fib_create_info+0x64d/0xaf7
  fib_table_insert+0xf6/0x581
  ? __vma_adjust+0x3b6/0x4d4
  inet_rtm_newroute+0x56/0x70
  rtnetlink_rcv_msg+0x1e3/0x20d
  ? rtnl_calcit.isra.0+0xb8/0xb8
  netlink_rcv_skb+0x5b/0xac
  netlink_unicast+0xfa/0x17b
  netlink_sendmsg+0x334/0x353
  sock_sendmsg_nosec+0xf/0x3f
  ____sys_sendmsg+0x1a0/0x1fc
  ? copy_msghdr_from_user+0x4c/0x61
  ___sys_sendmsg+0x63/0x84
  ? handle_mm_fault+0xa39/0x11b5
  ? sockfd_lookup_light+0x72/0x9a
  __sys_sendmsg+0x50/0x6e
  do_syscall_64+0x54/0xbe
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f10dacc0bb7
 Code: d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 8b 05 9a 4b 2b 00 85 c0 75 2e 48 63 ff 48 63 d2 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 01 c3 48 8b 15 b1 f2 2a 00 f7 d8 64 89 02 48
 RSP: 002b:00007ffcbe628bf8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007ffcbe628f80 RCX: 00007f10dacc0bb7
 RDX: 0000000000000000 RSI: 00007ffcbe628c60 RDI: 0000000000000003
 RBP: 000000005f41099c R08: 0000000000000001 R09: 0000000000000008
 R10: 00000000000005e9 R11: 0000000000000246 R12: 0000000000000000
 R13: 0000000000000000 R14: 00007ffcbe628d70 R15: 0000563a86c6e440
 Modules linked in:
 CR2: 0000000000000080

CC: David Ahern <dsahern@gmail.com>
Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Reported-by: syzbot+a61aa19b0c14c8770bd9@syzkaller.appspotmail.com
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:39:55 -07:00
Miaohe Lin
1aecbf1861 net: dccp: Convert to use the preferred fallthrough macro
Convert the uses of fallthrough comments to fallthrough macro.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-22 12:38:34 -07:00
Lorenz Bauer
0126240f44 bpf: sockmap: Allow update from BPF
Allow calling bpf_map_update_elem on sockmap and sockhash from a BPF
context. The synchronization required for this is a bit fiddly: we
need to prevent the socket from changing its state while we add it
to the sockmap, since we rely on getting a callback via
sk_prot->unhash. However, we can't just lock_sock like in
sock_map_sk_acquire because that might sleep. So instead we disable
softirq processing and use bh_lock_sock to prevent further
modification.

Yet, this is still not enough. BPF can be called in contexts where
the current CPU might have locked a socket. If the BPF can get
a hold of such a socket, inserting it into a sockmap would lead to
a deadlock. One straight forward example are sock_ops programs that
have ctx->sk, but the same problem exists for kprobes, etc.
We deal with this by allowing sockmap updates only from known safe
contexts. Improper usage is rejected by the verifier.

I've audited the enabled contexts to make sure they can't run in
a locked context. It's possible that CGROUP_SKB and others are
safe as well, but the auditing here is much more difficult. In
any case, we can extend the safe contexts when the need arises.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-6-lmb@cloudflare.com
2020-08-21 15:16:12 -07:00
Lorenz Bauer
13b79d3ffb bpf: sockmap: Call sock_map_update_elem directly
Don't go via map->ops to call sock_map_update_elem, since we know
what function to call in bpf_map_update_value. Since we currently
don't allow calling map_update_elem from BPF context, we can remove
ops->map_update_elem and rename the function to sock_map_update_elem_sys.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Yonghong Song <yhs@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-4-lmb@cloudflare.com
2020-08-21 15:16:11 -07:00
Lorenz Bauer
38e12f908a bpf: sockmap: Merge sockmap and sockhash update functions
Merge the two very similar functions sock_map_update_elem and
sock_hash_update_elem into one.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-3-lmb@cloudflare.com
2020-08-21 15:16:11 -07:00
Lorenz Bauer
7b219da43f net: sk_msg: Simplify sk_psock initialization
Initializing psock->sk_proto and other saved callbacks is only
done in sk_psock_update_proto, after sk_psock_init has returned.
The logic for this is difficult to follow, and needlessly complex.

Instead, initialize psock->sk_proto whenever we allocate a new
psock. Additionally, assert the following invariants:

* The SK has no ULP: ULP does it's own finagling of sk->sk_prot
* sk_user_data is unused: we need it to store sk_psock

Protect our access to sk_user_data with sk_callback_lock, which
is what other users like reuseport arrays, etc. do.

The result is that an sk_psock is always fully initialized, and
that psock->sk_proto is always the "original" struct proto.
The latter allows us to use psock->sk_proto when initializing
IPv6 TCP / UDP callbacks for sockmap.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200821102948.21918-2-lmb@cloudflare.com
2020-08-21 15:16:11 -07:00
Yonghong Song
b76f222690 bpf: Implement link_query callbacks in map element iterators
For bpf_map_elem and bpf_sk_local_storage bpf iterators,
additional map_id should be shown for fdinfo and
userspace query. For example, the following is for
a bpf_map_elem iterator.
  $ cat /proc/1753/fdinfo/9
  pos:    0
  flags:  02000000
  mnt_id: 14
  link_type:      iter
  link_id:        34
  prog_tag:       104be6d3fe45e6aa
  prog_id:        173
  target_name:    bpf_map_elem
  map_id: 127

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200821184419.574240-1-yhs@fb.com
2020-08-21 14:01:39 -07:00
David S. Miller
4af7b32f84 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2020-08-21

The following pull-request contains BPF updates for your *net* tree.

We've added 11 non-merge commits during the last 5 day(s) which contain
a total of 12 files changed, 78 insertions(+), 24 deletions(-).

The main changes are:

1) three fixes in BPF task iterator logic, from Yonghong.

2) fix for compressed dwarf sections in vmlinux, from Jiri.

3) fix xdp attach regression, from Andrii.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-21 12:54:50 -07:00
Florian Westphal
1e105e6afa netfilter: nf_tables: fix destination register zeroing
Following bug was reported via irc:
nft list ruleset
   set knock_candidates_ipv4 {
      type ipv4_addr . inet_service
      size 65535
      elements = { 127.0.0.1 . 123,
                   127.0.0.1 . 123 }
      }
 ..
   udp dport 123 add @knock_candidates_ipv4 { ip saddr . 123 }
   udp dport 123 add @knock_candidates_ipv4 { ip saddr . udp dport }

It should not have been possible to add a duplicate set entry.

After some debugging it turned out that the problem is the immediate
value (123) in the second-to-last rule.

Concatenations use 32bit registers, i.e. the elements are 8 bytes each,
not 6 and it turns out the kernel inserted

inet firewall @knock_candidates_ipv4
        element 0100007f ffff7b00  : 0 [end]
        element 0100007f 00007b00  : 0 [end]

Note the non-zero upper bits of the first element.  It turns out that
nft_immediate doesn't zero the destination register, but this is needed
when the length isn't a multiple of 4.

Furthermore, the zeroing in nft_payload is broken.  We can't use
[len / 4] = 0 -- if len is a multiple of 4, index is off by one.

Skip zeroing in this case and use a conditional instead of (len -1) / 4.

Fixes: 49499c3e6e ("netfilter: nf_tables: switch registers to 32 bit addressing")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 19:00:33 +02:00
Pablo Neira Ayuso
6f03bf43ee netfilter: nf_tables: add NFTA_SET_USERDATA if not null
Kernel sends an empty NFTA_SET_USERDATA attribute with no value if
userspace adds a set with no NFTA_SET_USERDATA attribute.

Fixes: e6d8ecac9e ("netfilter: nf_tables: Add new attributes into nft_set to store user data.")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 19:00:14 +02:00
Stefano Brivio
0726763043 netfilter: nft_set_rbtree: Detect partial overlap with start endpoint match
Getting creative with nft and omitting the interval_overlap()
check from the set_overlap() function, without omitting
set_overlap() altogether, led to the observation of a partial
overlap that wasn't detected, and would actually result in
replacement of the end element of an existing interval.

This is due to the fact that we'll return -EEXIST on a matching,
pre-existing start element, instead of -ENOTEMPTY, and the error
is cleared by API if NLM_F_EXCL is not given. At this point, we
can insert a matching start, and duplicate the end element as long
as we don't end up into other intervals.

For instance, inserting interval 0 - 2 with an existing 0 - 3
interval would result in a single 0 - 2 interval, and a dangling
'3' end element. This is because nft will proceed after inserting
the '0' start element as no error is reported, and no further
conflicting intervals are detected on insertion of the end element.

This needs a different approach as it's a local condition that can
be detected by looking for duplicate ends coming from left and
right, separately. Track those and directly report -ENOTEMPTY on
duplicated end elements for a matching start.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 17:37:36 +02:00
Stefano Brivio
226a88de47 netfilter: nft_set_rbtree: Handle outcomes of tree rotations in overlap detection
Checks for partial overlaps on insertion assume that end elements
are always descendant nodes of their corresponding start, because
they are inserted later. However, this is not the case if a
previous delete operation caused a tree rotation as part of
rebalancing.

Taking the issue reported by Andreas Fischer as an example, if we
omit delete operations, the existing procedure works because,
equivalently, we are inserting a start item with value 40 in the
this region of the red-black tree with single-sized intervals:

                                  overlap flag
                   10 (start)
                  /  \            false
                      20 (start)
                     /  \         false
                         30 (start)
                        /  \      false
                            60 (start)
                           /  \   false
                         50 (end)
                        /  \      false
                      20 (end)
                     /  \         false
                         40 (start)

if we now delete interval 30 - 30, the tree can be rearranged in
a way similar to this (note the rotation involving 50 - 50):

                                  overlap flag
                   10 (start)
                  /  \            false
                      20 (start)
                     /  \         false
                         25 (start)
                        /  \      false
                            70 (start)
                           /  \   false
                         50 (end)
                        /  \      true (from rule a1.)
                      50 (start)
                     /  \         true
                   40 (start)

and we traverse interval 50 - 50 from the opposite direction
compared to what was expected.

To deal with those cases, add a start-before-start rule, b4.,
that covers traversal of existing intervals from the right.

We now need to restrict start-after-end rule b3. to cases
where there are no occurring nodes between existing start and
end elements, because addition of rule b4. isn't sufficient to
ensure that the pre-existing end element we encounter while
descending the tree corresponds to a start element of an
interval that we already traversed entirely.

Different types of overlap detection on trees with rotations
resulting from re-balancing will be covered by nft test case
sets/0044interval_overlap_1.

Reported-by: Andreas Fischer <netfilter@d9c.eu>
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1449
Cc: <stable@vger.kernel.org> # 5.6.x
Fixes: 7c84d41416 ("netfilter: nft_set_rbtree: Detect partial overlaps on insertion")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-21 17:36:52 +02:00
Xin Long
f6db909641 tipc: call rcu_read_lock() in tipc_aead_encrypt_done()
b->media->send_msg() requires rcu_read_lock(), as we can see
elsewhere in tipc,  tipc_bearer_xmit, tipc_bearer_xmit_skb
and tipc_bearer_bc_xmit().

Syzbot has reported this issue as:

  net/tipc/bearer.c:466 suspicious rcu_dereference_check() usage!
  Workqueue: cryptd cryptd_queue_worker
  Call Trace:
   tipc_l2_send_msg+0x354/0x420 net/tipc/bearer.c:466
   tipc_aead_encrypt_done+0x204/0x3a0 net/tipc/crypto.c:761
   cryptd_aead_crypt+0xe8/0x1d0 crypto/cryptd.c:739
   cryptd_queue_worker+0x118/0x1b0 crypto/cryptd.c:181
   process_one_work+0x94c/0x1670 kernel/workqueue.c:2269
   worker_thread+0x64c/0x1120 kernel/workqueue.c:2415
   kthread+0x3b5/0x4a0 kernel/kthread.c:291
   ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:293

So fix it by calling rcu_read_lock() in tipc_aead_encrypt_done()
for b->media->send_msg().

Fixes: fc1b6d6de2 ("tipc: introduce TIPC encryption & authentication")
Reported-by: syzbot+47bbc6b678d317cccbe0@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 16:42:08 -07:00
Alaa Hleihel
eda814b97d net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flow
tcf_ct_handle_fragments() shouldn't free the skb when ip_defrag() call
fails. Otherwise, we will cause a double-free bug.
In such cases, just return the error to the caller.

Fixes: b57dc7c13e ("net/sched: Introduce action ct")
Signed-off-by: Alaa Hleihel <alaa@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 16:39:31 -07:00
David Laight
ab921f3cdb net: sctp: Fix negotiation of the number of data streams.
The number of output and input streams was never being reduced, eg when
processing received INIT or INIT_ACK chunks.
The effect is that DATA chunks can be sent with invalid stream ids
and then discarded by the remote system.

Fixes: 2075e50caf ("sctp: convert to genradix")
Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 16:37:37 -07:00
Mark Tomlinson
272502fcb7 gre6: Fix reception with IP6_TNL_F_RCV_DSCP_COPY
When receiving an IPv4 packet inside an IPv6 GRE packet, and the
IP6_TNL_F_RCV_DSCP_COPY flag is set on the tunnel, the IPv4 header would
get corrupted. This is due to the common ip6_tnl_rcv() function assuming
that the inner header is always IPv6. This patch checks the tunnel
protocol for IPv4 inner packets, but still defaults to IPv6.

Fixes: 308edfdf15 ("gre6: Cleanup GREv6 receive path, call common GRE functions")
Signed-off-by: Mark Tomlinson <mark.tomlinson@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 16:28:04 -07:00
Vishal Kulkarni
d0a84e1f38 ethtool: allow flow-type ether without IP protocol field
Set IP protocol mask only when IP protocol field is set.
This will allow flow-type ether with vlan rule which don't have
protocol field to apply.

ethtool -N ens5f4 flow-type ether proto 0x8100 vlan 0x600\
m 0x1FFF action 3 loc 16

Signed-off-by: Vishal Kulkarni <vishal@chelsio.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 16:26:03 -07:00
Eric Dumazet
394fcd8a81 net: zerocopy: combine pages in zerocopy_sg_from_iter()
Currently, tcp sendmsg(MSG_ZEROCOPY) is building skbs with order-0 fragments.
Compared to standard sendmsg(), these skbs usually contain up to 16 fragments
on arches with 4KB page sizes, instead of two.

This adds considerable costs on various ndo_start_xmit() handlers,
especially when IOMMU is in the picture.

As high performance applications are often using huge pages,
we can try to combine adjacent pages belonging to same
compound page.

Tested on AMD Rome platform, with IOMMU, nominal single TCP flow speed
is roughly doubled (~55Gbit -> ~100Gbit), when user application
is using hugepages.

For reference, nominal single TCP flow speed on this platform
without MSG_ZEROCOPY is ~65Gbit.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 16:12:50 -07:00
Andrii Nakryiko
c8a36f1945 bpf: xdp: Fix XDP mode when no mode flags specified
7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
inadvertently changed which XDP mode is assumed when no mode flags are
specified explicitly. Previously, driver mode was preferred, if driver
supported it. If not, generic SKB mode was chosen. That commit changed default
to SKB mode always. This patch fixes the issue and restores the original
logic.

Fixes: 7f0a838254 ("bpf, xdp: Maintain info on attached XDP BPF programs in net_device")
Reported-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/bpf/20200820052841.1559757-1-andriin@fb.com
2020-08-20 14:27:12 -07:00
Al Viro
cc44c17baf csum_partial_copy_nocheck(): drop the last argument
It's always 0.  Note that we theoretically could use ~0U as well -
result will be the same modulo 0xffff, _if_ the damn thing did the
right thing for any value of initial sum; later we'll make use of
that when convenient.

However, unlike csum_and_copy_..._user(), there are instances that
did not work for arbitrary initial sums; c6x is one such.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:14 -04:00
Al Viro
3ea7ca80d9 icmp_push_reply(): reorder adding the checksum up
do csum_partial_copy_nocheck() on the first fragment, then
add the rest to it.  Equivalent transformation.

That was the only caller of csum_partial_copy_nocheck() that
might pass it non-zero as the last argument.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:13 -04:00
Al Viro
8d5930dfb7 skb_copy_and_csum_bits(): don't bother with the last argument
it's always 0

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-08-20 15:45:13 -04:00
Peilin Ye
ce51f63e63 net/smc: Prevent kernel-infoleak in __smc_diag_dump()
__smc_diag_dump() is potentially copying uninitialized kernel stack memory
into socket buffers, since the compiler may leave a 4-byte hole near the
beginning of `struct smcd_diag_dmbinfo`. Fix it by initializing `dinfo`
with memset().

Fixes: 4b1b7d3b30 ("net/smc: add SMC-D diag support")
Suggested-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Peilin Ye <yepeilin.cs@gmail.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-20 12:07:31 -07:00
David Howells
1d4adfaf65 rxrpc: Make rxrpc_kernel_get_srtt() indicate validity
Fix rxrpc_kernel_get_srtt() to indicate the validity of the returned
smoothed RTT.  If we haven't had any valid samples yet, the SRTT isn't
useful.

Fixes: c410bf0193 ("rxrpc: Fix the excessive initial retransmission timeout")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-08-20 18:21:28 +01:00
David Howells
4700c4d80b rxrpc: Fix loss of RTT samples due to interposed ACK
The Rx protocol has a mechanism to help generate RTT samples that works by
a client transmitting a REQUESTED-type ACK when it receives a DATA packet
that has the REQUEST_ACK flag set.

The peer, however, may interpose other ACKs before transmitting the
REQUESTED-ACK, as can be seen in the following trace excerpt:

 rxrpc_tx_data: c=00000044 DATA d0b5ece8:00000001 00000001 q=00000001 fl=07
 rxrpc_rx_ack: c=00000044 00000001 PNG r=00000000 f=00000002 p=00000000 n=0
 rxrpc_rx_ack: c=00000044 00000002 REQ r=00000001 f=00000002 p=00000001 n=0
 ...

DATA packet 1 (q=xx) has REQUEST_ACK set (bit 1 of fl=xx).  The incoming
ping (labelled PNG) hard-acks the request DATA packet (f=xx exceeds the
sequence number of the DATA packet), causing it to be discarded from the Tx
ring.  The ACK that was requested (labelled REQ, r=xx references the serial
of the DATA packet) comes after the ping, but the sk_buff holding the
timestamp has gone and the RTT sample is lost.

This is particularly noticeable on RPC calls used to probe the service
offered by the peer.  A lot of peers end up with an unknown RTT because we
only ever sent a single RPC.  This confuses the server rotation algorithm.

Fix this by caching the information about the outgoing packet in RTT
calculations in the rxrpc_call struct rather than looking in the Tx ring.

A four-deep buffer is maintained and both REQUEST_ACK-flagged DATA and
PING-ACK transmissions are recorded in there.  When the appropriate
response ACK is received, the buffer is checked for a match and, if found,
an RTT sample is recorded.

If a received ACK refers to a packet with a later serial number than an
entry in the cache, that entry is presumed lost and the entry is made
available to record a new transmission.

ACKs types other than REQUESTED-type and PING-type cause any matching
sample to be cancelled as they don't necessarily represent a useful
measurement.

If there's no space in the buffer on ping/data transmission, the sample
base is discarded.

Fixes: 50235c4b5a ("rxrpc: Obtain RTT data by requesting ACKs on DATA packets")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-08-20 17:59:27 +01:00
David Howells
68528d937d rxrpc: Keep the ACK serial in a var in rxrpc_input_ack()
Keep the ACK serial number in a variable in rxrpc_input_ack() as it's used
frequently.

Signed-off-by: David Howells <dhowells@redhat.com>
2020-08-20 16:52:23 +01:00
Johannes Berg
fce2ff728f nl80211: fix NL80211_ATTR_HE_6GHZ_CAPABILITY usage
In nl80211_set_station(), we check NL80211_ATTR_HE_6GHZ_CAPABILITY
and then use NL80211_ATTR_HE_CAPABILITY, which is clearly wrong.
Fix this to use NL80211_ATTR_HE_6GHZ_CAPABILITY as well.

Cc: stable@vger.kernel.org
Fixes: 43e64bf301 ("cfg80211: handle 6 GHz capability of new station")
Link: https://lore.kernel.org/r/20200805153516.310cef625955.I0abc04dc8abb2c7c005c88ef8fa2d0e3c9fb95c4@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-08-20 17:23:03 +02:00
Alexei Starovoitov
d71fa5c976 bpf: Add kernel module with user mode driver that populates bpffs.
Add kernel module with user mode driver that populates bpffs with
BPF iterators.

$ mount bpffs /my/bpffs/ -t bpf
$ ls -la /my/bpffs/
total 4
drwxrwxrwt  2 root root    0 Jul  2 00:27 .
drwxr-xr-x 19 root root 4096 Jul  2 00:09 ..
-rw-------  1 root root    0 Jul  2 00:27 maps.debug
-rw-------  1 root root    0 Jul  2 00:27 progs.debug

The user mode driver will load BPF Type Formats, create BPF maps, populate BPF
maps, load two BPF programs, attach them to BPF iterators, and finally send two
bpf_link IDs back to the kernel.
The kernel will pin two bpf_links into newly mounted bpffs instance under
names "progs.debug" and "maps.debug". These two files become human readable.

$ cat /my/bpffs/progs.debug
  id name            attached
  11 dump_bpf_map    bpf_iter_bpf_map
  12 dump_bpf_prog   bpf_iter_bpf_prog
  27 test_pkt_access
  32 test_main       test_pkt_access test_pkt_access
  33 test_subprog1   test_pkt_access_subprog1 test_pkt_access
  34 test_subprog2   test_pkt_access_subprog2 test_pkt_access
  35 test_subprog3   test_pkt_access_subprog3 test_pkt_access
  36 new_get_skb_len get_skb_len test_pkt_access
  37 new_get_skb_ifindex get_skb_ifindex test_pkt_access
  38 new_get_constant get_constant test_pkt_access

The BPF program dump_bpf_prog() in iterators.bpf.c is printing this data about
all BPF programs currently loaded in the system. This information is unstable
and will change from kernel to kernel as ".debug" suffix conveys.

Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200819042759.51280-4-alexei.starovoitov@gmail.com
2020-08-20 16:02:36 +02:00
Florian Westphal
cc5453a5b7 netfilter: conntrack: allow sctp hearbeat after connection re-use
If an sctp connection gets re-used, heartbeats are flagged as invalid
because their vtag doesn't match.

Handle this in a similar way as TCP conntrack when it suspects that the
endpoints and conntrack are out-of-sync.

When a HEARTBEAT request fails its vtag validation, flag this in the
conntrack state and accept the packet.

When a HEARTBEAT_ACK is received with an invalid vtag in the reverse
direction after we allowed such a HEARTBEAT through, assume we are
out-of-sync and re-set the vtag info.

v2: remove left-over snippet from an older incarnation that moved
    new_state/old_state assignments, thats not needed so keep that
    as-is.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-20 14:13:49 +02:00
Kurt Kanzenbach
bdfbb63c31 ptp: Add generic ptp v2 header parsing function
Reason: A lot of the ptp drivers - which implement hardware time stamping - need
specific fields such as the sequence id from the ptp v2 header. Currently all
drivers implement that themselves.

Introduce a generic function to retrieve a pointer to the start of the ptp v2
header.

Suggested-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Richard Cochran <richardcochran@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-19 16:07:49 -07:00
Johannes Berg
d1fb555929 netlink: fix state reallocation in policy export
Evidently, when I did this previously, we didn't have more than
10 policies and didn't run into the reallocation path, because
it's missing a memset() for the unused policies. Fix that.

Fixes: d07dcf9aad ("netlink: add infrastructure to expose policies to userspace")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-19 15:39:36 -07:00
Miaohe Lin
f4ecc74853 net: Stop warning about SO_BSDCOMPAT usage
We've been warning about SO_BSDCOMPAT usage for many years. We may remove
this code completely now.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-19 12:53:48 -07:00
Wang Hai
ad112aa8b1 SUNRPC: remove duplicate include
Remove linux/sunrpc/auth_gss.h which is included more than once

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Wang Hai <wanghai38@huawei.com>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2020-08-19 13:19:42 -04:00
Colin Ian King
ad6641189c net: ipv4: remove duplicate "the the" phrase in Kconfig text
The Kconfig help text contains the phrase "the the" in the help
text. Fix this and reformat the block of help text.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 16:02:16 -07:00
Maxim Mikityanskiy
f01204ec8b ethtool: Don't omit the netlink reply if no features were changed
The legacy ethtool userspace tool shows an error when no features could
be changed. It's useful to have a netlink reply to be able to show this
error when __netdev_update_features wasn't called, for example:

1. ethtool -k eth0
   large-receive-offload: off
2. ethtool -K eth0 rx-fcs on
3. ethtool -K eth0 lro on
   Could not change any device features
   rx-lro: off [requested on]
4. ethtool -K eth0 lro on
   # The output should be the same, but without this patch the kernel
   # doesn't send the reply, and ethtool is unable to detect the error.

This commit makes ethtool-netlink always return a reply when requested,
and it still avoids unnecessary calls to __netdev_update_features if the
wanted features haven't changed.

Fixes: 0980bfcd69 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 16:00:24 -07:00
Maxim Mikityanskiy
2847bfed88 ethtool: Account for hw_features in netlink interface
ethtool-netlink ignores dev->hw_features and may confuse the drivers by
asking them to enable features not in the hw_features bitmask. For
example:

1. ethtool -k eth0
   tls-hw-tx-offload: off [fixed]
2. ethtool -K eth0 tls-hw-tx-offload on
   tls-hw-tx-offload: on
3. ethtool -k eth0
   tls-hw-tx-offload: on [fixed]

Fitler out dev->hw_features from req_wanted to fix it and to resemble
the legacy ethtool behavior.

Fixes: 0980bfcd69 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 16:00:24 -07:00
Maxim Mikityanskiy
840110a4ea ethtool: Fix preserving of wanted feature bits in netlink interface
Currently, ethtool-netlink calculates new wanted bits as:
(req_wanted & req_mask) | (old_active & ~req_mask)

It completely discards the old wanted bits, so they are forgotten with
the next ethtool command. Sample steps to reproduce:

1. ethtool -k eth0
   tx-tcp-segmentation: on # TSO is on from the beginning
2. ethtool -K eth0 tx off
   tx-tcp-segmentation: off [not requested]
3. ethtool -k eth0
   tx-tcp-segmentation: off [requested on]
4. ethtool -K eth0 rx off # Some change unrelated to TSO
5. ethtool -k eth0
   tx-tcp-segmentation: off # "Wanted on" is forgotten

This commit fixes it by changing the formula to:
(req_wanted & req_mask) | (old_wanted & ~req_mask),
where old_active was replaced by old_wanted to account for the wanted
bits.

The shortcut condition for the case where nothing was changed now
compares wanted bitmasks, instead of wanted to active.

Fixes: 0980bfcd69 ("ethtool: set netdev features with FEATURES_SET request")
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 16:00:24 -07:00
Xin Long
4ef1a7cb08 ipv6: some fixes for ipv6_dev_find()
This patch is to do 3 things for ipv6_dev_find():

  As David A. noticed,

  - rt6_lookup() is not really needed. Different from __ip_dev_find(),
    ipv6_dev_find() doesn't have a compatibility problem, so remove it.

  As Hideaki suggested,

  - "valid" (non-tentative) check for the address is also needed.
    ipv6_chk_addr() calls ipv6_chk_addr_and_flags(), which will
    traverse the address hash list, but it's heavy to be called
    inside ipv6_dev_find(). This patch is to reuse the code of
    ipv6_chk_addr_and_flags() for ipv6_dev_find().

  - dev parameter is passed into ipv6_dev_find(), as link-local
    addresses from user space has sin6_scope_id set and the dev
    lookup needs it.

Fixes: 81f6cb3122 ("ipv6: add ipv6_dev_find()")
Suggested-by: YOSHIFUJI Hideaki <hideaki.yoshifuji@miraclelinux.com>
Reported-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 15:58:53 -07:00
Miaohe Lin
eabe861881 net: handle the return value of pskb_carve_frag_list() correctly
pskb_carve_frag_list() may return -ENOMEM in pskb_carve_inside_nonlinear().
we should handle this correctly or we would get wrong sk_buff.

Fixes: 6fa01ccd88 ("skbuff: Add pskb_extract() helper function")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 15:56:56 -07:00
Miaohe Lin
e3ec1e8ca0 net: eliminate meaningless memcpy to data in pskb_carve_inside_nonlinear()
The frags of skb_shared_info of the data is assigned in following loop. It
is meaningless to do a memcpy of frags here.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 15:55:24 -07:00
Miaohe Lin
7f8901b74b net: tipc: Convert to use the preferred fallthrough macro
Convert the uses of fallthrough comments to fallthrough macro.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 12:46:48 -07:00
Johannes Berg
8aa26c575f netlink: make NLA_BINARY validation more flexible
Add range validation for NLA_BINARY, allowing validation of any
combination of combination minimum or maximum lengths, using the
existing NLA_POLICY_RANGE()/NLA_POLICY_FULL_RANGE() macros, just
like for integers where the value is checked.

Also make NLA_POLICY_EXACT_LEN(), NLA_POLICY_EXACT_LEN_WARN()
and NLA_POLICY_MIN_LEN() special cases of this, removing the old
types NLA_EXACT_LEN and NLA_MIN_LEN.

This allows us to save some code where both minimum and maximum
lengths are requires, currently the policy only allows maximum
(NLA_BINARY), minimum (NLA_MIN_LEN) or exact (NLA_EXACT_LEN), so
a range of lengths cannot be accepted and must be checked by the
code that consumes the attributes later.

Also, this allows advertising the correct ranges in the policy
export to userspace. Here, NLA_MIN_LEN and NLA_EXACT_LEN already
were special cases of NLA_BINARY with min and min/max length
respectively.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 12:28:45 -07:00
Johannes Berg
bc04358550 netlink: consistently use NLA_POLICY_MIN_LEN()
Change places that open-code NLA_POLICY_MIN_LEN() to
use the macro instead, giving us flexibility in how we
handle the details of the macro.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 12:28:45 -07:00
Johannes Berg
8140860c81 netlink: consistently use NLA_POLICY_EXACT_LEN()
Change places that open-code NLA_POLICY_EXACT_LEN() to
use the macro instead, giving us flexibility in how we
handle the details of the macro.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-18 12:28:45 -07:00
Jussi Kivilinna
279e89b228 batman-adv: bla: use netif_rx_ni when not in interrupt context
batadv_bla_send_claim() gets called from worker thread context through
batadv_bla_periodic_work(), thus netif_rx_ni needs to be used in that
case. This fixes "NOHZ: local_softirq_pending 08" log messages seen
when batman-adv is enabled.

Fixes: 23721387c4 ("batman-adv: add basic bridge loop avoidance code")
Signed-off-by: Jussi Kivilinna <jussi.kivilinna@haltian.com>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:40:03 +02:00
Linus Lüssing
d8bf0c0164 batman-adv: Fix own OGM check in aggregated OGMs
The own OGM check is currently misplaced and can lead to the following
issues:

For one thing we might receive an aggregated OGM from a neighbor node
which has our own OGM in the first place. We would then not only skip
our own OGM but erroneously also any other, following OGM in the
aggregate.

For another, we might receive an OGM aggregate which has our own OGM in
a place other then the first one. Then we would wrongly not skip this
OGM, leading to populating the orginator and gateway table with ourself.

Fixes: 9323158ef9 ("batman-adv: OGMv2 - implement originators logic")
Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:40:03 +02:00
Sven Eckelmann
303216e76d batman-adv: Avoid uninitialized chaddr when handling DHCP
The gateway client code can try to optimize the delivery of DHCP packets to
avoid broadcasting them through the whole mesh. But also transmissions to
the client can be optimized by looking up the destination via the chaddr of
the DHCP packet.

But the chaddr is currently only done when chaddr is fully inside the
non-paged area of the skbuff. Otherwise it will not be initialized and the
unoptimized path should have been taken.

But the implementation didn't handle this correctly. It didn't retrieve the
correct chaddr but still tried to perform the TT lookup with this
uninitialized memory.

Reported-by: syzbot+ab16e463b903f5a37036@syzkaller.appspotmail.com
Fixes: 6c413b1c22 ("batman-adv: send every DHCP packet as bat-unicast")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:40:03 +02:00
Sven Eckelmann
0093870aa8 batman-adv: Migrate to linux/prandom.h
The commit c0842fbc1b ("random32: move the pseudo-random 32-bit
definitions to prandom.h") introduced a new header for the pseudo random
functions from (previously) linux/random.h. One future goal of the
prandom.h change is to make code to switch just the new header file and to
avoid the implicit include. This would allow the removal of the implicit
include from random.h

Signed-off-by: Sven Eckelmann <sven@narfation.org>
2020-08-18 19:39:54 +02:00
Sven Eckelmann
21ba5ab2aa batman-adv: Drop repeated words in comments
checkpatch found various instances of "Possible repeated word" in various
comments.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:39:54 +02:00
Randy Dunlap
6f5b92a79c batman-adv: types.h: delete duplicated words
Delete the doubled word "time" in a comment.
Delete the doubled word "address" in a comment.

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:39:53 +02:00
Sven Eckelmann
c3b92dd490 batman-adv: Drop unused function batadv_hardif_remove_interfaces()
The function batadv_hardif_remove_interfaces was meant to remove all
interfaces which are currently in the list of known (compatible) hardifs
during module unload. But the function unregister_netdevice_notifier is
called in batadv_exit before batadv_hardif_remove_interfaces. This will
trigger NETDEV_UNREGISTER events for all available interfaces and in this
process remove all interfaces from batadv_hardif_list. And
batadv_hardif_remove_interfaces only operated on this (empty) list.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:39:53 +02:00
Simon Wunderlich
426988ee84 batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2020-08-18 19:39:53 +02:00
Linus Torvalds
4cf7562190 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Another batch of fixes:

  1) Remove nft_compat counter flush optimization, it generates warnings
     from the refcount infrastructure. From Florian Westphal.

  2) Fix BPF to search for build id more robustly, from Jiri Olsa.

  3) Handle bogus getopt lengths in ebtables, from Florian Westphal.

  4) Infoleak and other fixes to j1939 CAN driver, from Eric Dumazet and
     Oleksij Rempel.

  5) Reset iter properly on mptcp sendmsg() error, from Florian
     Westphal.

  6) Show a saner speed in bonding broadcast mode, from Jarod Wilson.

  7) Various kerneldoc fixes in bonding and elsewhere, from Lee Jones.

  8) Fix double unregister in bonding during namespace tear down, from
     Cong Wang.

  9) Disable RP filter during icmp_redirect selftest, from David Ahern"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (75 commits)
  otx2_common: Use devm_kcalloc() in otx2_config_npa()
  net: qrtr: fix usage of idr in port assignment to socket
  selftests: disable rp_filter for icmp_redirect.sh
  Revert "net: xdp: pull ethernet header off packet after computing skb->protocol"
  phylink: <linux/phylink.h>: fix function prototype kernel-doc warning
  mptcp: sendmsg: reset iter on error redux
  net: devlink: Remove overzealous WARN_ON with snapshots
  tipc: not enable tipc when ipv6 works as a module
  tipc: fix uninit skb->data in tipc_nl_compat_dumpit()
  net: Fix potential wrong skb->protocol in skb_vlan_untag()
  net: xdp: pull ethernet header off packet after computing skb->protocol
  ipvlan: fix device features
  bonding: fix a potential double-unregister
  can: j1939: add rxtimer for multipacket broadcast session
  can: j1939: abort multipacket broadcast session when timeout occurs
  can: j1939: cancel rxtimer on multipacket broadcast session complete
  can: j1939: fix support for multipacket broadcast message
  net: fddi: skfp: cfm: Remove seemingly unused variable 'ID_sccs'
  net: fddi: skfp: cfm: Remove set but unused variable 'oldstate'
  net: fddi: skfp: smt: Remove seemingly unused variable 'ID_sccs'
  ...
2020-08-17 17:09:50 -07:00
Necip Fazil Yildiran
8dfddfb796 net: qrtr: fix usage of idr in port assignment to socket
Passing large uint32 sockaddr_qrtr.port numbers for port allocation
triggers a warning within idr_alloc() since the port number is cast
to int, and thus interpreted as a negative number. This leads to
the rejection of such valid port numbers in qrtr_port_assign() as
idr_alloc() fails.

To avoid the problem, switch to idr_alloc_u32() instead.

Fixes: bdabad3e36 ("net: Add Qualcomm IPC router")
Reported-by: syzbot+f31428628ef672716ea8@syzkaller.appspotmail.com
Signed-off-by: Necip Fazil Yildiran <necip@google.com>
Reviewed-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 15:00:41 -07:00
David S. Miller
7f9bf6e824 Revert "net: xdp: pull ethernet header off packet after computing skb->protocol"
This reverts commit f8414a8d88.

eth_type_trans() does the necessary pull on the skb.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-17 11:48:05 -07:00
Sabrina Dubroca
4eb2e13415 espintcp: restore IP CB before handing the packet to xfrm
Xiumei reported a bug with espintcp over IPv6 in transport mode,
because xfrm6_transport_finish expects to find IP6CB data (struct
inet6_skb_cb). Currently, espintcp zeroes the CB, but the relevant
part is actually preserved by previous layers (first set up by tcp,
then strparser only zeroes a small part of tcp_skb_tb), so we can just
relocate it to the start of skb->cb.

Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-08-17 15:58:04 +02:00
Florian Westphal
b3b2854dcf mptcp: sendmsg: reset iter on error redux
This fix wasn't correct: When this function is invoked from the
retransmission worker, the iterator contains garbage and resetting
it causes a crash.

As the work queue should not be performance critical also zero the
msghdr struct.

Fixes: 3575938313 "(mptcp: sendmsg: reset iter on error)"
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:11:37 -07:00
Andrew Lunn
bd71ea6067 net: devlink: Remove overzealous WARN_ON with snapshots
It is possible to trigger this WARN_ON from user space by triggering a
devlink snapshot with an ID which already exists. We don't need both
-EEXISTS being reported and spamming the kernel log.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Tested-by: Chris Healy <cphealy@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:07:06 -07:00
Xin Long
c530189905 tipc: not enable tipc when ipv6 works as a module
When using ipv6_dev_find() in one module, it requires ipv6 not to
work as a module. Otherwise, this error occurs in build:

  undefined reference to `ipv6_dev_find'.

So fix it by adding "depends on IPV6 || IPV6=n" to tipc/Kconfig,
as it does in sctp/Kconfig.

Fixes: 5a6f6f5791 ("tipc: set ub->ifindex for local ipv6 address")
Reported-by: kernel test robot <lkp@intel.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:04:55 -07:00
Cong Wang
47733f9daf tipc: fix uninit skb->data in tipc_nl_compat_dumpit()
__tipc_nl_compat_dumpit() has two callers, and it expects them to
pass a valid nlmsghdr via arg->data. This header is artificial and
crafted just for __tipc_nl_compat_dumpit().

tipc_nl_compat_publ_dump() does so by putting a genlmsghdr as well
as some nested attribute, TIPC_NLA_SOCK. But the other caller
tipc_nl_compat_dumpit() does not, this leaves arg->data uninitialized
on this call path.

Fix this by just adding a similar nlmsghdr without any payload in
tipc_nl_compat_dumpit().

This bug exists since day 1, but the recent commit 6ea67769ff
("net: tipc: prepare attrs in __tipc_nl_compat_dumpit()") makes it
easier to appear.

Reported-and-tested-by: syzbot+0e7181deafa7e0b79923@syzkaller.appspotmail.com
Fixes: d0796d1ef6 ("tipc: convert legacy nl bearer dump to nl compat")
Cc: Jon Maloy <jmaloy@redhat.com>
Cc: Ying Xue <ying.xue@windriver.com>
Cc: Richard Alpe <richard.alpe@ericsson.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 21:03:19 -07:00
David S. Miller
8c26544f5a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Endianness issue in IPv4 option support in nft_exthdr,
   from Stephen Suryaputra.

2) Removes the waitcount optimization in nft_compat,
   from Florian Westphal.

3) Remove ipv6 -> nf_defrag_ipv6 module dependency, from
   Florian Westphal.

4) Memleak in chain binding support, also from Florian.

5) Simplify nft_flowtable.sh selftest, from Fabian Frederick.

6) Optional MTU arguments for selftest nft_flowtable.sh,
   also from Fabian.

7) Remove noise error report when killing process in
   selftest nft_flowtable.sh, from Fabian Frederick.

8) Reject bogus getsockopt option length in ebtables,
   from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 16:05:36 -07:00
David S. Miller
71a50419c7 linux-can-fixes-for-5.9-20200815
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl83p3MTHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqcsSB/9YMSKfcLUNv8eB/J3kPb4n6wlrBO8d
 BLHGddA5+qVbki5Tx3rQ7qA+fj60v2bVWQYiwC4okBIE3h+uIi/vq+S+oYjA7ncA
 E6bd0804AgKYbsGWaYMQOrOfVlLyJOk886XOsd2szkpopdPCdgHz45RVgx+wXkk/
 4fUgeiGuBSNI7Kgei6w+OrMqB+dXyUZR5D0MePobVnN4vMjEZd4UsD6TAB3y4C3L
 Ar9lAj/uPiiUEq/ow7J4kYtfw3DibMFM6JRAJ5+B7dXmBZ4TeAPxlG9XxmAKKHMX
 R/QxkyZnkE3AIba6YHpIeYLQZyFa41363W5Z70rsA+fQ5Mig5jWXoNnY
 =hWSB
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.9-20200815' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2020-08-15

this is a pull request of 4 patches for net/master.

All patches are by Zhang Changzhong and fix broadcast related problems in the
j1939 CAN networking stack.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 16:04:14 -07:00
Miaohe Lin
55eff0eb74 net: Fix potential wrong skb->protocol in skb_vlan_untag()
We may access the two bytes after vlan_hdr in vlan_set_encap_proto(). So
we should pull VLAN_HLEN + sizeof(unsigned short) in skb_vlan_untag() or
we may access the wrong data.

Fixes: 0d5501c1c8 ("net: Always untag vlan-tagged traffic on input.")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 15:33:49 -07:00
Jason A. Donenfeld
f8414a8d88 net: xdp: pull ethernet header off packet after computing skb->protocol
When an XDP program changes the ethernet header protocol field,
eth_type_trans is used to recalculate skb->protocol. In order for
eth_type_trans to work correctly, the ethernet header must actually be
part of the skb data segment, so the code first pushes that onto the
head of the skb. However, it subsequently forgets to pull it back off,
making the behavior of the passed-on packet inconsistent between the
protocol modifying case and the static protocol case. This patch fixes
the issue by simply pulling the ethernet header back off of the skb
head.

Fixes: 2972495699 ("net: fix generic XDP to handle if eth header was mangled")
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-16 15:28:18 -07:00
Linus Torvalds
410520d07f 9p pull request for inclusion in 5.9
- some code cleanup
 - a couple of static analysis fixes
 - setattr: try to pick a fid associated with the file rather than the
 dentry, which might sometimes matter
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl83dvkACgkQq06b7GqY
 5nDk4g/+KNpbIFHhDyhCzmZ1WbaUYlZseFm8VBR9dzwOkXCuVl+WQd9ducYlEifT
 cZ29xmEuom01LhFOs6VzC+IODQq3h1WvEbVRfCSlLEWvfHyPODSDY8mrn+1TdOZL
 UOOP1wUjaDqmAcBQ4PSDNmj2cONNWPqkgnuyswjsBpabOY/U8FxG9J/QiPfC12zs
 QoF5800OiuLYS5SefI9kKJ47DkTLpm/aq7/Z1liAxJBzHrguORjg16cssXhWeysD
 jEMD3pIrsqMDX/vNkAO0DaR8FWvxTFZR1Bza1ylJiUy3K3rLXKfo+42jiCnuo5e9
 uhA012RfiCZB3kIsLe5hCPXUWz0MhizaJAPkceHsgUtkuiU2qXyY05HEvS2HHl0s
 inqR8gRkZ2NhROx1R/0q+Wu2posCBderDwg7Mkzt/7j5Ue3cTIKvbeBxH9iLLaK+
 rHp0NbfP6wYN5qKjH1rru1IOmaUCV2HHV6x/alnwaVBfZqg5HYKkzlS7M+5hAbtb
 niLFXXCtPBGqS+u9w7sruj6utuiJjXBfB1ejaawY+frS41+v3NkHjjUWCH67u8I9
 u3pmPpvm5o7gRoRmvmxEdSb+jAYZwLTcbkUhRMHLQOHCBZXtcul41p2BUpoBm0zw
 42mlMMMBAjaxtzLlryvACg5lx6x7ykE4pExKAc0dzrzWfgzwEZQ=
 =hwGq
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:

 - some code cleanup

 - a couple of static analysis fixes

 - setattr: try to pick a fid associated with the file rather than the
   dentry, which might sometimes matter

* tag '9p-for-5.9-rc1' of git://github.com/martinetd/linux:
  9p: Remove unneeded cast from memory allocation
  9p: remove unused code in 9p
  net/9p: Fix sparse endian warning in trans_fd.c
  9p: Fix memory leak in v9fs_mount
  9p: retrieve fid from file when file instance exist.
2020-08-15 08:34:36 -07:00
Linus Torvalds
37711e5e23 NFS client updates for Linux 5.9
Highlights include:
 
 Stable fixes:
 - pNFS: Don't return layout segments that are being used for I/O
 - pNFS: Don't move layout segments off the active list when being used for I/O
 
 Features:
 - NFS: Add support for user xattrs through the NFSv4.2 protocol
 - NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
 - NFSv4.0 allow nconnect for v4.0
 
 Bugfixes and cleanups:
 - nfs: ensure correct writeback errors are returned on close()
 - nfs: nfs_file_write() should check for writeback errors
 - nfs: Fix getxattr kernel panic and memory overflow
 - NFS: Fix the pNFS/flexfiles mirrored read failover code
 - SUNRPC: dont update timeout value on connection reset
 - freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
 - sunrpc: destroy rpc_inode_cachep after unregister_filesystem
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEESQctxSBg8JpV8KqEZwvnipYKAPIFAl83CYEACgkQZwvnipYK
 APLx4g/9EQNTG5VkUToElRHs4b3vFxbmh9Odnk/JwPHxY5GQ8/AyGqwWHBgMZc7/
 2AV/C83pk7pJsDNsKbVAaFCT1cwjmItHM63vKJYGBYbE8LhkZ/1sYkdtPBYwHoVl
 7CWfpVY/4NjYw5GJfrVA5Y0m7lrQInRtIMzfaENw2tpw+/cKUpadxgEJltzFNvpa
 Ploinr1ZRBl1tvfeHNRP5ZMPk2AfgGWtQKQ/b2UWUk5tXALoQm2Eu+/oku39uqhy
 hW5tCbU2BzR91gg5JwF9n7VowkCHXfe7lMzDBTVfwZOELOmoyys/1wKv550FWcWi
 yymljWiPGZOnXGT1vfKptPESQjdtElMfanvEZ0BzS+yNR0HZGnIupaxGlYlG9ZGU
 2sXHQPp3mk2Q+L1IgbTSCnSju0YlZo32JQpYCZiROjIXnPWPQ50YNhr8GL18M1FW
 hTeShg2avWH+59GB6moEBmsuvui7Dy1jkimblToLEoGJ4kbvEl72FYSqTCkAXXbB
 rVUzhmJFgfk/EOS4d+QKJoBqNzw3aw79wyT7PLkoCYBqPZBHQexlmI+ktMbgUEdw
 c/fM7l5/Vcb9weIHKzul2Jbk5q6bFME/xPnIkr3v/oKIFlLFwQ04BX6R/42AMJHw
 V5Q9Wp81Vy6RXQMn8P0ZMeY0WQC/rhpijOEVMUC+Ni+spz44AdM=
 =k4pE
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client updates from Trond Myklebust:
 "Stable fixes:
   - pNFS: Don't return layout segments that are being used for I/O
   - pNFS: Don't move layout segments off the active list when being used for I/O

  Features:
   - NFS: Add support for user xattrs through the NFSv4.2 protocol
   - NFS: Allow applications to speed up readdir+statx() using AT_STATX_DONT_SYNC
   - NFSv4.0 allow nconnect for v4.0

  Bugfixes and cleanups:
   - nfs: ensure correct writeback errors are returned on close()
   - nfs: nfs_file_write() should check for writeback errors
   - nfs: Fix getxattr kernel panic and memory overflow
   - NFS: Fix the pNFS/flexfiles mirrored read failover code
   - SUNRPC: dont update timeout value on connection reset
   - freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
   - sunrpc: destroy rpc_inode_cachep after unregister_filesystem"

* tag 'nfs-for-5.9-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: (32 commits)
  NFS: Fix flexfiles read failover
  fs: nfs: delete repeated words in comments
  rpc_pipefs: convert comma to semicolon
  nfs: Fix getxattr kernel panic and memory overflow
  NFS: Don't return layout segments that are in use
  NFS: Don't move layouts to plh_return_segs list while in use
  NFS: Add layout segment info to pnfs read/write/commit tracepoints
  NFS: Add tracepoints for layouterror and layoutstats.
  NFS: Report the stateid + status in trace_nfs4_layoutreturn_on_close()
  SUNRPC dont update timeout value on connection reset
  nfs: nfs_file_write() should check for writeback errors
  nfs: ensure correct writeback errors are returned on close()
  NFSv4.2: xattr cache: get rid of cache discard work queue
  NFS: remove redundant initialization of variable result
  NFSv4.0 allow nconnect for v4.0
  freezer: Add unsafe versions of freezable_schedule_timeout_interruptible for NFS
  sunrpc: destroy rpc_inode_cachep after unregister_filesystem
  NFSv4.2: add client side xattr caching.
  NFSv4.2: hook in the user extended attribute handlers
  NFSv4.2: add the extended attribute proc functions.
  ...
2020-08-15 08:26:55 -07:00
Zhang Changzhong
0ae18a8268 can: j1939: add rxtimer for multipacket broadcast session
According to SAE J1939/21 (Chapter 5.12.3 and APPENDIX C), for transmit side
the required time interval between packets of a multipacket broadcast message
is 50 to 200 ms, the responder shall use a timeout of 250ms (provides margin
allowing for the maximumm spacing of 200ms). For receive side a timeout will
occur when a time of greater than 750 ms elapsed between two message packets
when more packets were expected.

So this patch fix and add rxtimer for multipacket broadcast session.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1596599425-5534-5-git-send-email-zhangchangzhong@huawei.com
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-15 11:12:58 +02:00
Zhang Changzhong
2b8b2e3155 can: j1939: abort multipacket broadcast session when timeout occurs
If timeout occurs, j1939_tp_rxtimer() first calls hrtimer_start() to restart
rxtimer, and then calls __j1939_session_cancel() to set session->state =
J1939_SESSION_WAITING_ABORT. At next timeout expiration, because of the
J1939_SESSION_WAITING_ABORT session state j1939_tp_rxtimer() will call
j1939_session_deactivate_activate_next() to deactivate current session, and
rxtimer won't be set.

But for multipacket broadcast session, __j1939_session_cancel() don't set
session->state = J1939_SESSION_WAITING_ABORT, thus current session won't be
deactivate and hrtimer_start() is called to start new rxtimer again and again.

So fix it by moving session->state = J1939_SESSION_WAITING_ABORT out of if
(!j1939_cb_is_broadcast(&session->skcb)) statement.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1596599425-5534-4-git-send-email-zhangchangzhong@huawei.com
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-15 11:12:58 +02:00
Zhang Changzhong
e8b1765308 can: j1939: cancel rxtimer on multipacket broadcast session complete
If j1939_xtp_rx_dat_one() receive last frame of multipacket broadcast message,
j1939_session_timers_cancel() should be called to cancel rxtimer.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1596599425-5534-3-git-send-email-zhangchangzhong@huawei.com
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-15 11:12:57 +02:00
Zhang Changzhong
f4fd77fd87 can: j1939: fix support for multipacket broadcast message
Currently j1939_tp_im_involved_anydir() in j1939_tp_recv() check the previously
set flags J1939_ECU_LOCAL_DST and J1939_ECU_LOCAL_SRC of incoming skb, thus
multipacket broadcast message was aborted by receive side because it may come
from remote ECUs and have no exact dst address. Similarly, j1939_tp_cmd_recv()
and j1939_xtp_rx_dat() didn't process broadcast message.

So fix it by checking and process broadcast message in j1939_tp_recv(),
j1939_tp_cmd_recv() and j1939_xtp_rx_dat().

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Zhang Changzhong <zhangchangzhong@huawei.com>
Link: https://lore.kernel.org/r/1596599425-5534-2-git-send-email-zhangchangzhong@huawei.com
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-15 11:12:57 +02:00
David S. Miller
10a3b7c1c3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-08-15

The following pull-request contains BPF updates for your *net* tree.

We've added 23 non-merge commits during the last 4 day(s) which contain
a total of 32 files changed, 421 insertions(+), 141 deletions(-).

The main changes are:

1) Fix sock_ops ctx access splat due to register override, from John Fastabend.

2) Batch of various fixes to libbpf, bpftool, and selftests when testing build
   in 32-bit mode, from Andrii Nakryiko.

3) Fix vmlinux.h generation on ARM by mapping GCC built-in types (__Poly*_t)
   to equivalent ones clang can work with, from Jean-Philippe Brucker.

4) Fix build_id lookup in bpf_get_stackid() helper by walking all NOTE ELF
   sections instead of just first, from Jiri Olsa.

5) Avoid use of __builtin_offsetof() in libbpf for CO-RE, from Yonghong Song.

6) Fix segfault in test_mmap due to inconsistent length params, from Jianlin Lv.

7) Don't override errno in libbpf when logging errors, from Toke Høiland-Jørgensen.

8) Fix v4_to_v6 sockaddr conversion in sk_lookup test, from Stanislav Fomichev.

9) Add link to bpf-helpers(7) man page to BPF doc, from Joe Stringer.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-14 17:12:23 -07:00
Florian Westphal
3575938313 mptcp: sendmsg: reset iter on error
Once we've copied data from the iterator we need to revert in case we
end up not sending any data.

This bug doesn't trigger with normal 'poll' based tests, because
we only feed a small chunk of data to kernel after poll indicated
POLLOUT.  With blocking IO and large writes this triggers. Receiver
ends up with less data than it should get.

Fixes: 72511aab95 ("mptcp: avoid blocking in tcp_sendpages")
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-14 14:11:37 -07:00
David S. Miller
e591d298cc linux-can-fixes-for-5.9-20200814
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEK3kIWJt9yTYMP3ehqclaivrt76kFAl82a18THG1rbEBwZW5n
 dXRyb25peC5kZQAKCRCpyVqK+u3vqVbYB/9wF0ekQk66ES0PlIKkTDqDlt8hH8gN
 a86yOFow3jjOWOlol1wDyCqh4i+JD0PgRPTCmXJoFYfXUsNwjRQt2ZqVgOlS0dR3
 wmxxNw0i7SQ3+LDEf3Dofgk39Y5+VSQGfT3fvM8FfHWi9Mkxmswi2EwCBMx7Cqyp
 lghxsJyZoeSnwfx7Wcel75NnYID1WscKUyXR6Qd240V1h7s3oAt6ZdDnYLPJnd7M
 zeRSY+HgDa98HBUMmQi627douQcTtXyE5Zpx/PkkRGEOYIS/SEF8XmA2cQ0Vu8R/
 R/u/htBqoV7OlL5XdyH1l2+HDOu0M1EKZiNWYMV34yobUqkrGg0oiJut
 =NuYv
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-fixes-for-5.9-20200814' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can

Marc Kleine-Budde says:

====================
pull-request: can 2020-08-14

this is a pull request of 6 patches for net/master. All patches fix problems in
the j1939 CAN networking stack.

The first patch is by Eric Dumazet fixes a kernel-infoleak in
j1939_sk_sock2sockaddr_can().

The remaining 5 patches are by Oleksij Rempel and fix recption of j1939
messages not orginated by the stack, a use-after-free in j1939_tp_txtimer(),
ensure that the CAN driver has a ml_priv allocated. These problem were found by
google's syzbot. Further ETP sessions with block size of less than 255 are
fixed and a sanity check was added to j1939_xtp_rx_dat_one() to detect packet
corruption.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-14 13:57:56 -07:00
Oleksij Rempel
e052d05402 can: j1939: transport: j1939_xtp_rx_dat_one(): compare own packets to detect corruptions
Since the stack relays on receiving own packets, it was overwriting own
transmit buffer from received packets.

At least theoretically, the received echo buffer can be corrupt or
changed and the session partner can request to resend previous data. In
this case we will re-send bad data.

With this patch we will stop to overwrite own TX buffer and use it for
sanity checking.

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200807105200.26441-6-o.rempel@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-14 12:38:47 +02:00
Oleksij Rempel
840835c928 can: j1939: transport: add j1939_session_skb_find_by_offset() function
Sometimes it makes no sense to search the skb by pkt.dpo, since we need
next the skb within the transaction block. This may happen if we have an
ETP session with CTS set to less than 255 packets.

After this patch, we will be able to work with ETP sessions where the
block size (ETP.CM_CTS byte 2) is less than 255 packets.

Reported-by: Henrique Figueira <henrislip@gmail.com>
Reported-by: https://github.com/linux-can/can-utils/issues/228
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200807105200.26441-5-o.rempel@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-14 12:38:47 +02:00
Oleksij Rempel
af804b7826 can: j1939: socket: j1939_sk_bind(): make sure ml_priv is allocated
This patch adds check to ensure that the struct net_device::ml_priv is
allocated, as it is used later by the j1939 stack.

The allocation is done by all mainline CAN network drivers, but when using
bond or team devices this is not the case.

Bail out if no ml_priv is allocated.

Reported-by: syzbot+f03d384f3455d28833eb@syzkaller.appspotmail.com
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Cc: linux-stable <stable@vger.kernel.org> # >= v5.4
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200807105200.26441-4-o.rempel@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-14 12:38:47 +02:00
Oleksij Rempel
cd3b3636c9 can: j1939: transport: j1939_session_tx_dat(): fix use-after-free read in j1939_tp_txtimer()
The current stack implementation do not support ECTS requests of not
aligned TP sized blocks.

If ECTS will request a block with size and offset spanning two TP
blocks, this will cause memcpy() to read beyond the queued skb (which
does only contain one TP sized block).

Sometimes KASAN will detect this read if the memory region beyond the
skb was previously allocated and freed. In other situations it will stay
undetected. The ETP transfer in any case will be corrupted.

This patch adds a sanity check to avoid this kind of read and abort the
session with error J1939_XTP_ABORT_ECTS_TOO_BIG.

Reported-by: syzbot+5322482fe520b02aea30@syzkaller.appspotmail.com
Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Cc: linux-stable <stable@vger.kernel.org> # >= v5.4
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200807105200.26441-3-o.rempel@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-14 12:38:47 +02:00
Oleksij Rempel
b43e3a82bc can: j1939: transport: j1939_simple_recv(): ignore local J1939 messages send not by J1939 stack
In current J1939 stack implementation, we process all locally send
messages as own messages. Even if it was send by CAN_RAW socket.

To reproduce it use following commands:
testj1939 -P -r can0:0x80 &
cansend can0 18238040#0123

This step will trigger false positive not critical warning:
j1939_simple_recv: Received already invalidated message

With this patch we add additional check to make sure, related skb is own
echo message.

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200807105200.26441-2-o.rempel@pengutronix.de
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-14 12:38:47 +02:00
Eric Dumazet
38ba8b9241 can: j1939: fix kernel-infoleak in j1939_sk_sock2sockaddr_can()
syzbot found that at least 2 bytes of kernel information
were leaked during getsockname() on AF_CAN CAN_J1939 socket.

Since struct sockaddr_can has in fact two holes, simply
clear the whole area before filling it with useful data.

BUG: KMSAN: kernel-infoleak in kmsan_copy_to_user+0x81/0x90 mm/kmsan/kmsan_hooks.c:253
CPU: 0 PID: 8466 Comm: syz-executor511 Not tainted 5.8.0-rc5-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x21c/0x280 lib/dump_stack.c:118
 kmsan_report+0xf7/0x1e0 mm/kmsan/kmsan_report.c:121
 kmsan_internal_check_memory+0x238/0x3d0 mm/kmsan/kmsan.c:423
 kmsan_copy_to_user+0x81/0x90 mm/kmsan/kmsan_hooks.c:253
 instrument_copy_to_user include/linux/instrumented.h:91 [inline]
 _copy_to_user+0x18e/0x260 lib/usercopy.c:39
 copy_to_user include/linux/uaccess.h:186 [inline]
 move_addr_to_user+0x3de/0x670 net/socket.c:237
 __sys_getsockname+0x407/0x5e0 net/socket.c:1909
 __do_sys_getsockname net/socket.c:1920 [inline]
 __se_sys_getsockname+0x91/0xb0 net/socket.c:1917
 __x64_sys_getsockname+0x4a/0x70 net/socket.c:1917
 do_syscall_64+0xad/0x160 arch/x86/entry/common.c:386
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x440219
Code: Bad RIP value.
RSP: 002b:00007ffe5ee150c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000033
RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 0000000000440219
RDX: 0000000020000240 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8
R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a20
R13: 0000000000401ab0 R14: 0000000000000000 R15: 0000000000000000

Local variable ----address@__sys_getsockname created at:
 __sys_getsockname+0x91/0x5e0 net/socket.c:1894
 __sys_getsockname+0x91/0x5e0 net/socket.c:1894

Bytes 2-3 of 24 are uninitialized
Memory access of size 24 starts at ffff8880ba2c7de8
Data copied to user address 0000000020000100

Fixes: 9d71dd0c70 ("can: add support of SAE J1939 protocol")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Robin van der Gracht <robin@protonic.nl>
Cc: Oleksij Rempel <o.rempel@pengutronix.de>
Cc: Pengutronix Kernel Team <kernel@pengutronix.de>
Cc: linux-can@vger.kernel.org
Acked-by: Oleksij Rempel <o.rempel@pengutronix.de>
Link: https://lore.kernel.org/r/20200813161834.4021638-1-edumazet@google.com
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2020-08-14 12:31:10 +02:00
Florian Westphal
5c04da55c7 netfilter: ebtables: reject bogus getopt len value
syzkaller reports splat:
------------[ cut here ]------------
Buffer overflow detected (80 < 137)!
Call Trace:
 do_ebt_get_ctl+0x2b4/0x790 net/bridge/netfilter/ebtables.c:2317
 nf_getsockopt+0x72/0xd0 net/netfilter/nf_sockopt.c:116
 ip_getsockopt net/ipv4/ip_sockglue.c:1778 [inline]

caused by a copy-to-user with a too-large "*len" value.
This adds a argument check on *len just like in the non-compat version
of the handler.

Before the "Fixes" commit, the reproducer fails with -EINVAL as
expected:
1. core calls the "compat" getsockopt version
2. compat getsockopt version detects the *len value is possibly
   in 64-bit layout (*len != compat_len)
3. compat getsockopt version delegates everything to native getsockopt
   version
4. native getsockopt rejects invalid *len

-> compat handler only sees len == sizeof(compat_struct) for GET_ENTRIES.

After the refactor, event sequence is:
1. getsockopt calls "compat" version (len != native_len)
2. compat version attempts to copy *len bytes, where *len is random
   value from userspace

Fixes: fc66de8e16 ("netfilter/ebtables: clean up compat {get, set}sockopt handling")
Reported-by: syzbot+5accb5c62faa1d346480@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-14 11:59:08 +02:00
Linus Torvalds
a1d21081a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "Some merge window fallout, some longer term fixes:

   1) Handle headroom properly in lapbether and x25_asy drivers, from
      Xie He.

   2) Fetch MAC address from correct r8152 device node, from Thierry
      Reding.

   3) In the sw kTLS path we should allow MSG_CMSG_COMPAT in sendmsg,
      from Rouven Czerwinski.

   4) Correct fdputs in socket layer, from Miaohe Lin.

   5) Revert troublesome sockptr_t optimization, from Christoph Hellwig.

   6) Fix TCP TFO key reading on big endian, from Jason Baron.

   7) Missing CAP_NET_RAW check in nfc, from Qingyu Li.

   8) Fix inet fastreuse optimization with tproxy sockets, from Tim
      Froidcoeur.

   9) Fix 64-bit divide in new SFC driver, from Edward Cree.

  10) Add a tracepoint for prandom_u32 so that we can more easily
      perform usage analysis. From Eric Dumazet.

  11) Fix rwlock imbalance in AF_PACKET, from John Ogness"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (49 commits)
  net: openvswitch: introduce common code for flushing flows
  af_packet: TPACKET_V3: fix fill status rwlock imbalance
  random32: add a tracepoint for prandom_u32()
  Revert "ipv4: tunnel: fix compilation on ARCH=um"
  net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus
  net: ethernet: stmmac: Disable hardware multicast filter
  net: stmmac: dwmac1000: provide multicast filter fallback
  ipv4: tunnel: fix compilation on ARCH=um
  vsock: fix potential null pointer dereference in vsock_poll()
  sfc: fix ef100 design-param checking
  net: initialize fastreuse on inet_inherit_port
  net: refactor bind_bucket fastreuse into helper
  net: phy: marvell10g: fix null pointer dereference
  net: Fix potential memory leak in proto_register()
  net: qcom/emac: add missed clk_disable_unprepare in error path of emac_clks_phase1_init
  ionic_lif: Use devm_kcalloc() in ionic_qcq_alloc()
  net/nfc/rawsock.c: add CAP_NET_RAW check.
  hinic: fix strncpy output truncated compile warnings
  drivers/net/wan/x25_asy: Added needed_headroom and a skb->len check
  net/tls: Fix kmap usage
  ...
2020-08-13 20:03:11 -07:00
Tonghao Zhang
1f3a090b90 net: openvswitch: introduce common code for flushing flows
To avoid some issues, for example RCU usage warning and double free,
we should flush the flows under ovs_lock. This patch refactors
table_instance_destroy and introduces table_instance_flow_flush
which can be invoked by __dp_destroy or ovs_flow_tbl_flush.

Fixes: 50b0e61b32 ("net: openvswitch: fix possible memleak on destroy flow-table")
Reported-by: Johan Knöös <jknoos@google.com>
Reported-at: https://mail.openvswitch.org/pipermail/ovs-discuss/2020-August/050489.html
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-13 15:53:30 -07:00
John Ogness
88fd1cb80d af_packet: TPACKET_V3: fix fill status rwlock imbalance
After @blk_fill_in_prog_lock is acquired there is an early out vnet
situation that can occur. In that case, the rwlock needs to be
released.

Also, since @blk_fill_in_prog_lock is only acquired when @tp_version
is exactly TPACKET_V3, only release it on that exact condition as
well.

And finally, add sparse annotation so that it is clearer that
prb_fill_curr_block() and prb_clear_blk_fill_status() are acquiring
and releasing @blk_fill_in_prog_lock, respectively. sparse is still
unable to understand the balance, but the warnings are now on a
higher level that make more sense.

Fixes: 632ca50f2c ("af_packet: TPACKET_V3: replace busy-wait loop")
Signed-off-by: John Ogness <john.ogness@linutronix.de>
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-13 15:37:30 -07:00
John Fastabend
84f44df664 bpf: sock_ops sk access may stomp registers when dst_reg = src_reg
Similar to patch ("bpf: sock_ops ctx access may stomp registers") if the
src_reg = dst_reg when reading the sk field of a sock_ops struct we
generate xlated code,

  53: (61) r9 = *(u32 *)(r9 +28)
  54: (15) if r9 == 0x0 goto pc+3
  56: (79) r9 = *(u64 *)(r9 +0)

This stomps on the r9 reg to do the sk_fullsock check and then when
reading the skops->sk field instead of the sk pointer we get the
sk_fullsock. To fix use similar pattern noted in the previous fix
and use the temp field to save/restore a register used to do
sk_fullsock check.

After the fix the generated xlated code reads,

  52: (7b) *(u64 *)(r9 +32) = r8
  53: (61) r8 = *(u32 *)(r9 +28)
  54: (15) if r9 == 0x0 goto pc+3
  55: (79) r8 = *(u64 *)(r9 +32)
  56: (79) r9 = *(u64 *)(r9 +0)
  57: (05) goto pc+1
  58: (79) r8 = *(u64 *)(r9 +32)

Here r9 register was in-use so r8 is chosen as the temporary register.
In line 52 r8 is saved in temp variable and at line 54 restored in case
fullsock != 0. Finally we handle fullsock == 0 case by restoring at
line 58.

This adds a new macro SOCK_OPS_GET_SK it is almost possible to merge
this with SOCK_OPS_GET_FIELD, but I found the extra branch logic a
bit more confusing than just adding a new macro despite a bit of
duplicating code.

Fixes: 1314ef5611 ("bpf: export bpf_sock for BPF_PROG_TYPE_SOCK_OPS prog type")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/159718349653.4728.6559437186853473612.stgit@john-Precision-5820-Tower
2020-08-13 22:40:40 +02:00
John Fastabend
fd09af0107 bpf: sock_ops ctx access may stomp registers in corner case
I had a sockmap program that after doing some refactoring started spewing
this splat at me:

[18610.807284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001
[...]
[18610.807359] Call Trace:
[18610.807370]  ? 0xffffffffc114d0d5
[18610.807382]  __cgroup_bpf_run_filter_sock_ops+0x7d/0xb0
[18610.807391]  tcp_connect+0x895/0xd50
[18610.807400]  tcp_v4_connect+0x465/0x4e0
[18610.807407]  __inet_stream_connect+0xd6/0x3a0
[18610.807412]  ? __inet_stream_connect+0x5/0x3a0
[18610.807417]  inet_stream_connect+0x3b/0x60
[18610.807425]  __sys_connect+0xed/0x120

After some debugging I was able to build this simple reproducer,

 __section("sockops/reproducer_bad")
 int bpf_reproducer_bad(struct bpf_sock_ops *skops)
 {
        volatile __maybe_unused __u32 i = skops->snd_ssthresh;
        return 0;
 }

And along the way noticed that below program ran without splat,

__section("sockops/reproducer_good")
int bpf_reproducer_good(struct bpf_sock_ops *skops)
{
        volatile __maybe_unused __u32 i = skops->snd_ssthresh;
        volatile __maybe_unused __u32 family;

        compiler_barrier();

        family = skops->family;
        return 0;
}

So I decided to check out the code we generate for the above two
programs and noticed each generates the BPF code you would expect,

0000000000000000 <bpf_reproducer_bad>:
;       volatile __maybe_unused __u32 i = skops->snd_ssthresh;
       0:       r1 = *(u32 *)(r1 + 96)
       1:       *(u32 *)(r10 - 4) = r1
;       return 0;
       2:       r0 = 0
       3:       exit

0000000000000000 <bpf_reproducer_good>:
;       volatile __maybe_unused __u32 i = skops->snd_ssthresh;
       0:       r2 = *(u32 *)(r1 + 96)
       1:       *(u32 *)(r10 - 4) = r2
;       family = skops->family;
       2:       r1 = *(u32 *)(r1 + 20)
       3:       *(u32 *)(r10 - 8) = r1
;       return 0;
       4:       r0 = 0
       5:       exit

So we get reasonable assembly, but still something was causing the null
pointer dereference. So, we load the programs and dump the xlated version
observing that line 0 above 'r* = *(u32 *)(r1 +96)' is going to be
translated by the skops access helpers.

int bpf_reproducer_bad(struct bpf_sock_ops * skops):
; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
   0: (61) r1 = *(u32 *)(r1 +28)
   1: (15) if r1 == 0x0 goto pc+2
   2: (79) r1 = *(u64 *)(r1 +0)
   3: (61) r1 = *(u32 *)(r1 +2340)
; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
   4: (63) *(u32 *)(r10 -4) = r1
; return 0;
   5: (b7) r0 = 0
   6: (95) exit

int bpf_reproducer_good(struct bpf_sock_ops * skops):
; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
   0: (61) r2 = *(u32 *)(r1 +28)
   1: (15) if r2 == 0x0 goto pc+2
   2: (79) r2 = *(u64 *)(r1 +0)
   3: (61) r2 = *(u32 *)(r2 +2340)
; volatile __maybe_unused __u32 i = skops->snd_ssthresh;
   4: (63) *(u32 *)(r10 -4) = r2
; family = skops->family;
   5: (79) r1 = *(u64 *)(r1 +0)
   6: (69) r1 = *(u16 *)(r1 +16)
; family = skops->family;
   7: (63) *(u32 *)(r10 -8) = r1
; return 0;
   8: (b7) r0 = 0
   9: (95) exit

Then we look at lines 0 and 2 above. In the good case we do the zero
check in r2 and then load 'r1 + 0' at line 2. Do a quick cross-check
into the bpf_sock_ops check and we can confirm that is the 'struct
sock *sk' pointer field. But, in the bad case,

   0: (61) r1 = *(u32 *)(r1 +28)
   1: (15) if r1 == 0x0 goto pc+2
   2: (79) r1 = *(u64 *)(r1 +0)

Oh no, we read 'r1 +28' into r1, this is skops->fullsock and then in
line 2 we read the 'r1 +0' as a pointer. Now jumping back to our spat,

[18610.807284] BUG: unable to handle kernel NULL pointer dereference at 0000000000000001

The 0x01 makes sense because that is exactly the fullsock value. And
its not a valid dereference so we splat.

To fix we need to guard the case when a program is doing a sock_ops field
access with src_reg == dst_reg. This is already handled in the load case
where the ctx_access handler uses a tmp register being careful to
store the old value and restore it. To fix the get case test if
src_reg == dst_reg and in this case do the is_fullsock test in the
temporary register. Remembering to restore the temporary register before
writing to either dst_reg or src_reg to avoid smashing the pointer into
the struct holding the tmp variable.

Adding this inline code to test_tcpbpf_kern will now be generated
correctly from,

  9: r2 = *(u32 *)(r2 + 96)

to xlated code,

  12: (7b) *(u64 *)(r2 +32) = r9
  13: (61) r9 = *(u32 *)(r2 +28)
  14: (15) if r9 == 0x0 goto pc+4
  15: (79) r9 = *(u64 *)(r2 +32)
  16: (79) r2 = *(u64 *)(r2 +0)
  17: (61) r2 = *(u32 *)(r2 +2348)
  18: (05) goto pc+1
  19: (79) r9 = *(u64 *)(r2 +32)

And in the normal case we keep the original code, because really this
is an edge case. From this,

  9: r2 = *(u32 *)(r6 + 96)

to xlated code,

  22: (61) r2 = *(u32 *)(r6 +28)
  23: (15) if r2 == 0x0 goto pc+2
  24: (79) r2 = *(u64 *)(r6 +0)
  25: (61) r2 = *(u32 *)(r2 +2348)

So three additional instructions if dst == src register, but I scanned
my current code base and did not see this pattern anywhere so should
not be a big deal. Further, it seems no one else has hit this or at
least reported it so it must a fairly rare pattern.

Fixes: 9b1f3d6e5a ("bpf: Refactor sock_ops_convert_ctx_access")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/159718347772.4728.2781381670567919577.stgit@john-Precision-5820-Tower
2020-08-13 22:40:36 +02:00
Florian Westphal
59136aa3b2 netfilter: nf_tables: free chain context when BINDING flag is missing
syzbot found a memory leak in nf_tables_addchain() because the chain
object is not free'd correctly on error.

Fixes: d0e2c7de92 ("netfilter: nf_tables: add NFT_CHAIN_BINDING")
Reported-by: syzbot+c99868fde67014f7e9f5@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-13 04:17:46 +02:00
Florian Westphal
2404b73c3f netfilter: avoid ipv6 -> nf_defrag_ipv6 module dependency
nf_ct_frag6_gather is part of nf_defrag_ipv6.ko, not ipv6 core.

The current use of the netfilter ipv6 stub indirections  causes a module
dependency between ipv6 and nf_defrag_ipv6.

This prevents nf_defrag_ipv6 module from being removed because ipv6 can't
be unloaded.

Remove the indirection and always use a direct call.  This creates a
depency from nf_conntrack_bridge to nf_defrag_ipv6 instead:

modinfo nf_conntrack
depends:        nf_conntrack,nf_defrag_ipv6,bridge

.. and nf_conntrack already depends on nf_defrag_ipv6 anyway.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-13 04:16:15 +02:00
Andrii Nakryiko
068d9d1eba bpf: Fix XDP FD-based attach/detach logic around XDP_FLAGS_UPDATE_IF_NOEXIST
Enforce XDP_FLAGS_UPDATE_IF_NOEXIST only if new BPF program to be attached is
non-NULL (i.e., we are not detaching a BPF program).

Fixes: d4baa9368a ("bpf, xdp: Extract common XDP program attachment logic")
Reported-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Andrii Nakryiko <andriin@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Tested-by: Stanislav Fomichev <sdf@google.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20200812022923.1217922-1-andriin@fb.com
2020-08-12 18:00:49 -07:00
David S. Miller
9643609423 Revert "ipv4: tunnel: fix compilation on ARCH=um"
This reverts commit 06a7a37be5.

The bug was already fixed, this added a dup include.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-12 13:26:37 -07:00
Eric Dumazet
2e0d8fef5f net: accept an empty mask in /sys/class/net/*/queues/rx-*/rps_cpus
We must accept an empty mask in store_rps_map(), or we are not able
to disable RPS on a queue.

Fixes: 07bbecb341 ("net: Restrict receive packets queuing to housekeeping CPUs")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Maciej Żenczykowski <maze@google.com>
Cc: Alex Belits <abelits@marvell.com>
Cc: Nitesh Narayan Lal <nitesh@redhat.com>
Cc: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Maciej Żenczykowski <maze@google.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Nitesh Narayan Lal <nitesh@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-12 13:18:16 -07:00
Johannes Berg
06a7a37be5 ipv4: tunnel: fix compilation on ARCH=um
With certain configurations, a 64-bit ARCH=um errors
out here with an unknown csum_ipv6_magic() function.
Include the right header file to always have it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-12 12:57:51 -07:00
Stefano Garzarella
1980c05844 vsock: fix potential null pointer dereference in vsock_poll()
syzbot reported this issue where in the vsock_poll() we find the
socket state at TCP_ESTABLISHED, but 'transport' is null:
  general protection fault, probably for non-canonical address 0xdffffc0000000012: 0000 [#1] PREEMPT SMP KASAN
  KASAN: null-ptr-deref in range [0x0000000000000090-0x0000000000000097]
  CPU: 0 PID: 8227 Comm: syz-executor.2 Not tainted 5.8.0-rc7-syzkaller #0
  Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
  RIP: 0010:vsock_poll+0x75a/0x8e0 net/vmw_vsock/af_vsock.c:1038
  Call Trace:
   sock_poll+0x159/0x460 net/socket.c:1266
   vfs_poll include/linux/poll.h:90 [inline]
   do_pollfd fs/select.c:869 [inline]
   do_poll fs/select.c:917 [inline]
   do_sys_poll+0x607/0xd40 fs/select.c:1011
   __do_sys_poll fs/select.c:1069 [inline]
   __se_sys_poll fs/select.c:1057 [inline]
   __x64_sys_poll+0x18c/0x440 fs/select.c:1057
   do_syscall_64+0x60/0xe0 arch/x86/entry/common.c:384
   entry_SYSCALL_64_after_hwframe+0x44/0xa9

This issue can happen if the TCP_ESTABLISHED state is set after we read
the vsk->transport in the vsock_poll().

We could put barriers to synchronize, but this can only happen during
connection setup, so we can simply check that 'transport' is valid.

Fixes: c0cfa2d8a7 ("vsock: add multi-transports support")
Reported-and-tested-by: syzbot+a61bac2fcc1a7c6623fe@syzkaller.appspotmail.com
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-12 12:56:06 -07:00
Linus Torvalds
7c2a69f610 Xiubo has completed his work on filesystem client metrics, they are
sent to all available MDSes once per second now.  Other than that, we
 have a lot of fixes and cleanups all around the filesystem, including
 a tweak to cut down on MDS request resends in multi-MDS setups from
 Yanhu and fixups for SELinux symlink labeling and MClientSession
 message decoding from Jeff.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAl80I3MTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzi7rlB/4/tegnfsB+O6MQPt+DkSxjQ6+90J/O
 aDj3jnwHoUBUitTP92wHdy/TanlZMX0UQqq4caWt1Ot0Rveb6XTfTIzEDtG2ejYQ
 dcH6LB3dS9Qvjrwh7oRawx8Xux/RdIIA99OvAX7K+4IeLKP5PjmwfdM1Mm0agfoS
 V8qXm8cGyqF2lWown092Znr8s6OSsDEaUbjB2tlz5msWeK9rSFD5bDYOBdOTkBCU
 5v5BIUwWWoU9YFMWDJgVNS5l2Scq5yTDEleZcCCJMovKNrhBi9DcwUDKrDQJ3+5v
 UmfI5M/L99Ng7yAcPqbeOZ1HibuOrdmqAoR6lCiD+PPEPrf99rVLRinQ
 =2NtI
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "Xiubo has completed his work on filesystem client metrics, they are
  sent to all available MDSes once per second now.

  Other than that, we have a lot of fixes and cleanups all around the
  filesystem, including a tweak to cut down on MDS request resends in
  multi-MDS setups from Yanhu and fixups for SELinux symlink labeling
  and MClientSession message decoding from Jeff"

* tag 'ceph-for-5.9-rc1' of git://github.com/ceph/ceph-client: (22 commits)
  ceph: handle zero-length feature mask in session messages
  ceph: use frag's MDS in either mode
  ceph: move sb->wb_pagevec_pool to be a global mempool
  ceph: set sec_context xattr on symlink creation
  ceph: remove redundant initialization of variable mds
  ceph: fix use-after-free for fsc->mdsc
  ceph: remove unused variables in ceph_mdsmap_decode()
  ceph: delete repeated words in fs/ceph/
  ceph: send client provided metric flags in client metadata
  ceph: periodically send perf metrics to MDSes
  ceph: check the sesion state and return false in case it is closed
  libceph: replace HTTP links with HTTPS ones
  ceph: remove unnecessary cast in kfree()
  libceph: just have osd_req_op_init() return a pointer
  ceph: do not access the kiocb after aio requests
  ceph: clean up and optimize ceph_check_delayed_caps()
  ceph: fix potential mdsc use-after-free crash
  ceph: switch to WARN_ON_ONCE in encode_supported_features()
  ceph: add global total_caps to count the mdsc's total caps number
  ceph: add check_session_state() helper and make it global
  ...
2020-08-12 12:51:31 -07:00
Tim Froidcoeur
d76f3351ce net: initialize fastreuse on inet_inherit_port
In the case of TPROXY, bind_conflict optimizations for SO_REUSEADDR or
SO_REUSEPORT are broken, possibly resulting in O(n) instead of O(1) bind
behaviour or in the incorrect reuse of a bind.

the kernel keeps track for each bind_bucket if all sockets in the
bind_bucket support SO_REUSEADDR or SO_REUSEPORT in two fastreuse flags.
These flags allow skipping the costly bind_conflict check when possible
(meaning when all sockets have the proper SO_REUSE option).

For every socket added to a bind_bucket, these flags need to be updated.
As soon as a socket that does not support reuse is added, the flag is
set to false and will never go back to true, unless the bind_bucket is
deleted.

Note that there is no mechanism to re-evaluate these flags when a socket
is removed (this might make sense when removing a socket that would not
allow reuse; this leaves room for a future patch).

For this optimization to work, it is mandatory that these flags are
properly initialized and updated.

When a child socket is created from a listen socket in
__inet_inherit_port, the TPROXY case could create a new bind bucket
without properly initializing these flags, thus preventing the
optimization to work. Alternatively, a socket not allowing reuse could
be added to an existing bind bucket without updating the flags, causing
bind_conflict to never be called as it should.

Call inet_csk_update_fastreuse when __inet_inherit_port decides to create
a new bind_bucket or use a different bind_bucket than the one of the
listen socket.

Fixes: 093d282321 ("tproxy: fix hash locking issue when using port redirection in __inet_inherit_port()")
Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-11 15:49:08 -07:00
Tim Froidcoeur
62ffc589ab net: refactor bind_bucket fastreuse into helper
Refactor the fastreuse update code in inet_csk_get_port into a small
helper function that can be called from other places.

Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Tim Froidcoeur <tim.froidcoeur@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-11 15:49:08 -07:00
Miaohe Lin
0f5907af39 net: Fix potential memory leak in proto_register()
If we failed to assign proto idx, we free the twsk_slab_name but forget to
free the twsk_slab. Add a helper function tw_prot_cleanup() to free these
together and also use this helper function in proto_unregister().

Fixes: b45ce32135 ("sock: fix potential memory leak in proto_register()")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-11 15:36:14 -07:00
Qingyu Li
26896f0146 net/nfc/rawsock.c: add CAP_NET_RAW check.
When creating a raw AF_NFC socket, CAP_NET_RAW needs to be checked first.

Signed-off-by: Qingyu Li <ieatmuttonchuan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-11 10:34:30 -07:00
Ira Weiny
b06c19d9f8 net/tls: Fix kmap usage
When MSG_OOB is specified to tls_device_sendpage() the mapped page is
never unmapped.

Hold off mapping the page until after the flags are checked and the page
is actually needed.

Fixes: e8f6979981 ("net/tls: Add generic NIC offload infrastructure")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-11 10:20:34 -07:00
Linus Torvalds
97d052ea3f A set of locking fixes and updates:
- Untangle the header spaghetti which causes build failures in various
     situations caused by the lockdep additions to seqcount to validate that
     the write side critical sections are non-preemptible.
 
   - The seqcount associated lock debug addons which were blocked by the
     above fallout.
 
     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict per
     CPU seqcounts. As the lock is not part of the seqcount, lockdep cannot
     validate that the lock is held.
 
     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored and
     write_seqcount_begin() has a lockdep assertion to validate that the
     lock is held.
 
     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API is
     unchanged and determines the type at compile time with the help of
     _Generic which is possible now that the minimal GCC version has been
     moved up.
 
     Adding this lockdep coverage unearthed a handful of seqcount bugs which
     have been addressed already independent of this.
 
     While generaly useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if the
     writers are serialized by an associated lock, which leads to the well
     known reader preempts writer livelock. RT prevents this by storing the
     associated lock pointer independent of lockdep in the seqcount and
     changing the reader side to block on the lock when a reader detects
     that a writer is in the write side critical section.
 
  - Conversion of seqcount usage sites to associated types and initializers.
 -----BEGIN PGP SIGNATURE-----
 
 iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAl8xmPYTHHRnbHhAbGlu
 dXRyb25peC5kZQAKCRCmGPVMDXSYoTuQEACyzQCjU8PgehPp9oMqWzaX2fcVyuZO
 QU2yw6gmz2oTz3ZHUNwdW8UnzGh2OWosK3kDruoD9FtSS51lER1/ISfSPCGfyqxC
 KTjOcB1Kvxwq/3LcCx7Zi3ZxWApat74qs3EhYhKtEiQ2Y9xv9rLq8VV1UWAwyxq0
 eHpjlIJ6b6rbt+ARslaB7drnccOsdK+W/roNj4kfyt+gezjBfojGRdMGQNMFcpnv
 shuTC+vYurAVIiVA/0IuizgHfwZiXOtVpjVoEWaxg6bBH6HNuYMYzdSa/YrlDkZs
 n/aBI/Xkvx+Eacu8b1Zwmbzs5EnikUK/2dMqbzXKUZK61eV4hX5c2xrnr1yGWKTs
 F/juh69Squ7X6VZyKVgJ9RIccVueqwR2EprXWgH3+RMice5kjnXH4zURp0GHALxa
 DFPfB6fawcH3Ps87kcRFvjgm6FBo0hJ1AxmsW1dY4ACFB9azFa2euW+AARDzHOy2
 VRsUdhL9CGwtPjXcZ/9Rhej6fZLGBXKr8uq5QiMuvttp4b6+j9FEfBgD4S6h8csl
 AT2c2I9LcbWqyUM9P4S7zY/YgOZw88vHRuDH7tEBdIeoiHfrbSBU7EQ9jlAKq/59
 f+Htu2Io281c005g7DEeuCYvpzSYnJnAitj5Lmp/kzk2Wn3utY1uIAVszqwf95Ul
 81ppn2KlvzUK8g==
 =7Gj+
 -----END PGP SIGNATURE-----

Merge tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

Pull locking updates from Thomas Gleixner:
 "A set of locking fixes and updates:

   - Untangle the header spaghetti which causes build failures in
     various situations caused by the lockdep additions to seqcount to
     validate that the write side critical sections are non-preemptible.

   - The seqcount associated lock debug addons which were blocked by the
     above fallout.

     seqcount writers contrary to seqlock writers must be externally
     serialized, which usually happens via locking - except for strict
     per CPU seqcounts. As the lock is not part of the seqcount, lockdep
     cannot validate that the lock is held.

     This new debug mechanism adds the concept of associated locks.
     sequence count has now lock type variants and corresponding
     initializers which take a pointer to the associated lock used for
     writer serialization. If lockdep is enabled the pointer is stored
     and write_seqcount_begin() has a lockdep assertion to validate that
     the lock is held.

     Aside of the type and the initializer no other code changes are
     required at the seqcount usage sites. The rest of the seqcount API
     is unchanged and determines the type at compile time with the help
     of _Generic which is possible now that the minimal GCC version has
     been moved up.

     Adding this lockdep coverage unearthed a handful of seqcount bugs
     which have been addressed already independent of this.

     While generally useful this comes with a Trojan Horse twist: On RT
     kernels the write side critical section can become preemtible if
     the writers are serialized by an associated lock, which leads to
     the well known reader preempts writer livelock. RT prevents this by
     storing the associated lock pointer independent of lockdep in the
     seqcount and changing the reader side to block on the lock when a
     reader detects that a writer is in the write side critical section.

   - Conversion of seqcount usage sites to associated types and
     initializers"

* tag 'locking-urgent-2020-08-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
  locking/seqlock, headers: Untangle the spaghetti monster
  locking, arch/ia64: Reduce <asm/smp.h> header dependencies by moving XTP bits into the new <asm/xtp.h> header
  x86/headers: Remove APIC headers from <asm/smp.h>
  seqcount: More consistent seqprop names
  seqcount: Compress SEQCNT_LOCKNAME_ZERO()
  seqlock: Fold seqcount_LOCKNAME_init() definition
  seqlock: Fold seqcount_LOCKNAME_t definition
  seqlock: s/__SEQ_LOCKDEP/__SEQ_LOCK/g
  hrtimer: Use sequence counter with associated raw spinlock
  kvm/eventfd: Use sequence counter with associated spinlock
  userfaultfd: Use sequence counter with associated spinlock
  NFSv4: Use sequence counter with associated spinlock
  iocost: Use sequence counter with associated spinlock
  raid5: Use sequence counter with associated spinlock
  vfs: Use sequence counter with associated spinlock
  timekeeping: Use sequence counter with associated raw spinlock
  xfrm: policy: Use sequence counters with associated lock
  netfilter: nft_set_rbtree: Use sequence counter with associated rwlock
  netfilter: conntrack: Use sequence counter with associated spinlock
  sched: tasks: Use sequence counter with associated spinlock
  ...
2020-08-10 19:07:44 -07:00
Jason Baron
f19008e676 tcp: correct read of TFO keys on big endian systems
When TFO keys are read back on big endian systems either via the global
sysctl interface or via getsockopt() using TCP_FASTOPEN_KEY, the values
don't match what was written.

For example, on s390x:

# echo "1-2-3-4" > /proc/sys/net/ipv4/tcp_fastopen_key
# cat /proc/sys/net/ipv4/tcp_fastopen_key
02000000-01000000-04000000-03000000

Instead of:

# cat /proc/sys/net/ipv4/tcp_fastopen_key
00000001-00000002-00000003-00000004

Fix this by converting to the correct endianness on read. This was
reported by Colin Ian King when running the 'tcp_fastopen_backup_key' net
selftest on s390x, which depends on the read value matching what was
written. I've confirmed that the test now passes on big and little endian
systems.

Signed-off-by: Jason Baron <jbaron@akamai.com>
Fixes: 438ac88009 ("net: fastopen: robustness and endianness fixes for SipHash")
Cc: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Cc: Eric Dumazet <edumazet@google.com>
Reported-and-tested-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-10 12:12:35 -07:00
Christoph Hellwig
519a8a6cf9 net: Revert "net: optimize the sockptr_t for unified kernel/user address spaces"
This reverts commits 6d04fe15f7 and
a31edb2059.

It turns out the idea to share a single pointer for both kernel and user
space address causes various kinds of problems.  So use the slightly less
optimal version that uses an extra bit, but which is guaranteed to be safe
everywhere.

Fixes: 6d04fe15f7 ("net: optimize the sockptr_t for unified kernel/user address spaces")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: John Stultz <john.stultz@linaro.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-10 12:06:44 -07:00
Florian Westphal
2f941622fd netfilter: nft_compat: remove flush counter optimization
WARNING: CPU: 1 PID: 16059 at lib/refcount.c:31 refcount_warn_saturate+0xdf/0xf
[..]
 __nft_mt_tg_destroy+0x42/0x50 [nft_compat]
 nft_target_destroy+0x63/0x80 [nft_compat]
 nf_tables_expr_destroy+0x1b/0x30 [nf_tables]
 nf_tables_rule_destroy+0x3a/0x70 [nf_tables]
 nf_tables_exit_net+0x186/0x3d0 [nf_tables]

Happens when a compat expr is destoyed from abort path.
There is no functional impact; after this work queue is flushed
unconditionally if its pending.

This removes the waitcount optimization.  Test of repeated
iptables-restore of a ~60k kubernetes ruleset doesn't indicate
a slowdown.  In case the counter is needed after all for some workloads
we can revert this and increment the refcount for the
!= NFT_PREPARE_TRANS case to avoid the increment/decrement imbalance.

While at it, also flush for match case, this was an oversight
in the original patch.

Fixes: ffe8923f10 ("netfilter: nft_compat: make sure xtables destructors have run")
Reported-by: kernel test robot <rong.a.chen@intel.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-10 13:03:36 +02:00
Stephen Suryaputra
b428336676 netfilter: nf_tables: nft_exthdr: the presence return value should be little-endian
On big-endian machine, the returned register data when the exthdr is
present is not being compared correctly because little-endian is
assumed. The function nft_cmp_fast_mask(), called by nft_cmp_fast_eval()
and nft_cmp_fast_init(), calls cpu_to_le32().

The following dump also shows that little endian is assumed:

$ nft --debug=netlink add rule ip recordroute forward ip option rr exists counter
ip
  [ exthdr load ipv4 1b @ 7 + 0 present => reg 1 ]
  [ cmp eq reg 1 0x01000000 ]
  [ counter pkts 0 bytes 0 ]

Lastly, debug print in nft_cmp_fast_init() and nft_cmp_fast_eval() when
RR option exists in the packet shows that the comparison fails because
the assumption:

nft_cmp_fast_init:189 priv->sreg=4 desc.len=8 mask=0xff000000 data.data[0]=0x10003e0
nft_cmp_fast_eval:57 regs->data[priv->sreg=4]=0x1 mask=0xff000000 priv->data=0x1000000

v2: use nft_reg_store8() instead (Florian Westphal). Also to avoid the
    warnings reported by kernel test robot.

Fixes: dbb5281a1f ("netfilter: nf_tables: add support for matching IPv4 options")
Fixes: c078ca3b0c ("netfilter: nft_exthdr: Add support for existence check")
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-08-10 13:02:43 +02:00
Linus Torvalds
7a6b60441f Highlights:
- Support for user extended attributes on NFS (RFC 8276)
 - Further reduce unnecessary NFSv4 delegation recalls
 
 Notable fixes:
 
 - Fix recent krb5p regression
 - Address a few resource leaks and a rare NULL dereference
 
 Other:
 
 - De-duplicate RPC/RDMA error handling and other utility functions
 - Replace storage and display of kernel memory addresses by tracepoints
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAl8oBt0ACgkQM2qzM29m
 f5dTFQ/9H72E6gr1onsia0/Py0CO8F9qzLgmUBl1vVYAh2/vPqUL1ypxrC5OYrAy
 TOqESTsJvmGluCFc/77XUTD7NvJY3znIWim49okwDiyee4Y14ZfRhhCxyyA6Z94E
 FjJQb5TbF1Mti4X3dN8Gn7O1Y/BfTjDAAXnXGlTA1xoLcxM5idWIj+G8x0bPmeDb
 2fTbgsoETu6MpS2/L6mraXVh3d5ESOJH+73YvpBl0AhYPzlNASJZMLtHtd+A/JbO
 IPkMP/7UA5DuJtWGeuQ4I4D5bQNpNWMfN6zhwtih4IV5bkRC7vyAOLG1R7w9+Ufq
 58cxPiorMcsg1cHnXG0Z6WVtbMEdWTP/FzmJdE5RC7DEJhmmSUG/R0OmgDcsDZET
 GovPARho01yp80GwTjCIctDHRRFRL4pdPfr8PjVHetSnx9+zoRUT+D70Zeg/KSy2
 99gmCxqSY9BZeHoiVPEX/HbhXrkuDjUSshwl98OAzOFmv6kbwtLntgFbWlBdE6dB
 mqOxBb73zEoZ5P9GA2l2ShU3GbzMzDebHBb9EyomXHZrLejoXeUNA28VJ+8vPP5S
 IVHnEwOkdJrNe/7cH4jd/B0NR6f8Da/F9kmkLiG2GNPMqQ8bnVhxTUtZkcAE+fd4
 f34qLxsoht70wSSfISjBs7hP5KxEM1lOAf0w0RpycPUKJNV1FB0=
 =OEpF
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6

Pull NFS server updates from Chuck Lever:
 "Highlights:
   - Support for user extended attributes on NFS (RFC 8276)
   - Further reduce unnecessary NFSv4 delegation recalls

  Notable fixes:
   - Fix recent krb5p regression
   - Address a few resource leaks and a rare NULL dereference

  Other:
   - De-duplicate RPC/RDMA error handling and other utility functions
   - Replace storage and display of kernel memory addresses by tracepoints"

* tag 'nfsd-5.9' of git://git.linux-nfs.org/projects/cel/cel-2.6: (38 commits)
  svcrdma: CM event handler clean up
  svcrdma: Remove transport reference counting
  svcrdma: Fix another Receive buffer leak
  SUNRPC: Refresh the show_rqstp_flags() macro
  nfsd: netns.h: delete a duplicated word
  SUNRPC: Fix ("SUNRPC: Add "@len" parameter to gss_unwrap()")
  nfsd: avoid a NULL dereference in __cld_pipe_upcall()
  nfsd4: a client's own opens needn't prevent delegations
  nfsd: Use seq_putc() in two functions
  svcrdma: Display chunk completion ID when posting a rw_ctxt
  svcrdma: Record send_ctxt completion ID in trace_svcrdma_post_send()
  svcrdma: Introduce Send completion IDs
  svcrdma: Record Receive completion ID in svc_rdma_decode_rqst
  svcrdma: Introduce Receive completion IDs
  svcrdma: Introduce infrastructure to support completion IDs
  svcrdma: Add common XDR encoders for RDMA and Read segments
  svcrdma: Add common XDR decoders for RDMA and Read segments
  SUNRPC: Add helpers for decoding list discriminators symbolically
  svcrdma: Remove declarations for functions long removed
  svcrdma: Clean up trace_svcrdma_send_failed() tracepoint
  ...
2020-08-09 13:58:04 -07:00
Miaohe Lin
7c7ab580db net: Convert to use the fallthrough macro
Convert the uses of fallthrough comments to fallthrough macro.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-08 14:29:09 -07:00
Miaohe Lin
11f920d2aa net: Use helper function ip_is_fragment()
Use helper function ip_is_fragment() to check ip fragment.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-08 14:24:53 -07:00
Miaohe Lin
47260ba937 net: Remove meaningless jump label out_fs
The out_fs jump label has nothing to do but goto out.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-08 14:23:21 -07:00
Miaohe Lin
ce787a5a07 net: Set fput_needed iff FDPUT_FPUT is set
We should fput() file iff FDPUT_FPUT is set. So we should set fput_needed
accordingly.

Fixes: 00e188ef6a ("sockfd_lookup_light(): switch to fdget^W^Waway from fget_light")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-08 14:21:42 -07:00
Miaohe Lin
6b07edebe6 net: Use helper function fdput()
Use helper function fdput() to fput() the file iff FDPUT_FPUT is set.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-08 14:19:16 -07:00
YueHaibing
61ee4137b5 ip_vti: Fix unused variable warning
If CONFIG_INET_XFRM_TUNNEL is set but CONFIG_IPV6 is n,

net/ipv4/ip_vti.c:493:27: warning: 'vti_ipip6_handler' defined but not used [-Wunused-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-08-08 08:17:22 +02:00
Rouven Czerwinski
1c3b63f155 net/tls: allow MSG_CMSG_COMPAT in sendmsg
Trying to use ktls on a system with 32-bit userspace and 64-bit kernel
results in a EOPNOTSUPP message during sendmsg:

  setsockopt(3, SOL_TLS, TLS_TX, …, 40) = 0
  sendmsg(3, …, msg_flags=0}, 0) = -1 EOPNOTSUPP (Operation not supported)

The tls_sw implementation does strict flag checking and does not allow
the MSG_CMSG_COMPAT flag, which is set if the message comes in through
the compat syscall.

This patch adds MSG_CMSG_COMPAT to the flag check to allow the usage of
the TLS SW implementation on systems using the compat syscall path.

Note that the same check is present in the sendmsg path for the TLS
device implementation, however the flag hasn't been added there for lack
of testing hardware.

Signed-off-by: Rouven Czerwinski <r.czerwinski@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-07 17:40:45 -07:00
David S. Miller
64cae2fb48 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-08-08

The following pull-request contains BPF updates for your *net* tree.

We've added 11 non-merge commits during the last 2 day(s) which contain
a total of 24 files changed, 216 insertions(+), 135 deletions(-).

The main changes are:

1) Fix UAPI for BPF map iterator before it gets frozen to allow for more
   extensions/customization in future, from Yonghong Song.

2) Fix selftests build to undo verbose build output, from Andrii Nakryiko.

3) Fix inlining compilation error on bpf_do_trace_printk() due to variable
   argument lists, from Stanislav Fomichev.

4) Fix an uninitialized pointer warning at btf__parse_raw() in libbpf,
   from Daniel T. Lee.

5) Fix several compilation warnings in selftests with regards to ignoring
   return value, from Jianlin Lv.

6) Fix interruptions by switching off timeout for BPF tests, from Jiri Benc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-07 17:33:08 -07:00
Paolo Abeni
7ee2492635 mptcp: fix warn at shutdown time for unaccepted msk sockets
With commit b93df08ccd ("mptcp: explicitly track the fully
established status"), the status of unaccepted mptcp closed in
mptcp_sock_destruct() changes from TCP_SYN_RECV to TCP_ESTABLISHED.

As a result mptcp_sock_destruct() does not perform the proper
cleanup and inet_sock_destruct() will later emit a warn.

Address the issue updating the condition tested in mptcp_sock_destruct().
Also update the related comment.

Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/66
Reported-and-tested-by: Christoph Paasch <cpaasch@apple.com>
Fixes: b93df08ccd ("mptcp: explicitly track the fully established status")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-07 17:26:16 -07:00
Linus Torvalds
1fa2c0a0c8 Fix SCM_RIGHTS compat mode
-----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8trzgWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJqrDEACRlxrvLRsMX5MCdZ3SQnnHkGbp
 7C/OP331cRe0HnAp1B3P8VhP/Y81fDiCnrP9cb7yoeMASCvCUhVQlMZpLVpppOUE
 NK0XQdjMlHJGh9qTcgWGd49KpFOFFYIeiv4NaddlmScb8Q+ZkMKiQWURNR65iM8c
 QLcpfryZEZd7eEZdH8Xtn6ItrYEYTldXSC8oc+e3ev0HCg0D7E4nBaqp1JzHT9I2
 Zku98lyap2sxSBZlTKVppowZ7RPqViwRkvGMofWE7brbJ2+H89YMXhj0ZMFQ8+8z
 htbG6KCWUP7kdUwcyuBgUX8/+cGMsb9R8U34ktxO7GWq1stHAph4e59kJ0cRKjNp
 v7vFL5nA6kvgYjZ9Y7dQoeKQWFAcN7qK2/DR/11ECx21pHbR/6LgyPDTYYgN88Bp
 wcbNgPr1AI7bxu/FKe+6a+qgIjV5Ukb4SxNKU7pvkKKSDCzhMxGGCRY9Pmip2gmv
 EHKOox7QgONGhvuTrE2CYr5hAO5alKbciTXnofAmCpi1ddTrpv5wVDDSSJXK9xNe
 uZjMvYGGnJWw8KLZiqbC6P1k92Lm8hfFzPuLBuuUILrG5MZsAiLGIOleXTaaehPS
 ZRPE1h2A95lFulOPXdX0fdQkWeVSf/IkBNxA3Uz/xM7QaF75n83VypFxVfEFzmos
 7R14u32ySUeuZX0/wg==
 =5TFR
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.9-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp fix from Kees Cook:
 "This fixes my typo in the SCM_RIGHTS refactoring that broke compat
  handling.

  Thanks to Thadeu Lima de Souza Cascardo for tracking it down, and to
  Christian Zigotzky and Alex Xu for their reports"

* tag 'seccomp-v5.9-rc1-fix1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  net/scm: Fix typo in SCM_RIGHTS compat refactoring
2020-08-07 13:16:22 -07:00
Kees Cook
16b89f6953 net/scm: Fix typo in SCM_RIGHTS compat refactoring
When refactoring the SCM_RIGHTS code, I accidentally mis-merged my
native/compat diffs, which entirely broke using SCM_RIGHTS in compat
mode. Use the correct helper.

Reported-by: Christian Zigotzky <chzigotzky@xenosoft.de>
Link: https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-August/216156.html
Reported-by: "Alex Xu (Hello71)" <alex_y_xu@yahoo.ca>
Link: https://lore.kernel.org/lkml/1596812929.lz7fuo8r2w.none@localhost/
Suggested-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Fixes: c0029de509 ("net/scm: Regularize compat handling of scm_detach_fds()")
Tested-by: Alex Xu (Hello71) <alex_y_xu@yahoo.ca>
Acked-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Signed-off-by: Kees Cook <keescook@chromium.org>
2020-08-07 12:43:25 -07:00
Linus Torvalds
81e11336d9 Merge branch 'akpm' (patches from Andrew)
Merge misc updates from Andrew Morton:

 - a few MM hotfixes

 - kthread, tools, scripts, ntfs and ocfs2

 - some of MM

Subsystems affected by this patch series: kthread, tools, scripts, ntfs,
ocfs2 and mm (hofixes, pagealloc, slab-generic, slab, slub, kcsan,
debug, pagecache, gup, swap, shmem, memcg, pagemap, mremap, mincore,
sparsemem, vmalloc, kasan, pagealloc, hugetlb and vmscan).

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (162 commits)
  mm: vmscan: consistent update to pgrefill
  mm/vmscan.c: fix typo
  khugepaged: khugepaged_test_exit() check mmget_still_valid()
  khugepaged: retract_page_tables() remember to test exit
  khugepaged: collapse_pte_mapped_thp() protect the pmd lock
  khugepaged: collapse_pte_mapped_thp() flush the right range
  mm/hugetlb: fix calculation of adjust_range_if_pmd_sharing_possible
  mm: thp: replace HTTP links with HTTPS ones
  mm/page_alloc: fix memalloc_nocma_{save/restore} APIs
  mm/page_alloc.c: skip setting nodemask when we are in interrupt
  mm/page_alloc: fallbacks at most has 3 elements
  mm/page_alloc: silence a KASAN false positive
  mm/page_alloc.c: remove unnecessary end_bitidx for [set|get]_pfnblock_flags_mask()
  mm/page_alloc.c: simplify pageblock bitmap access
  mm/page_alloc.c: extract the common part in pfn_to_bitidx()
  mm/page_alloc.c: replace the definition of NR_MIGRATETYPE_BITS with PB_migratetype_bits
  mm/shuffle: remove dynamic reconfiguration
  mm/memory_hotplug: document why shuffle_zone() is relevant
  mm/page_alloc: remove nr_free_pagecache_pages()
  mm: remove vm_total_pages
  ...
2020-08-07 11:39:33 -07:00
Waiman Long
453431a549 mm, treewide: rename kzfree() to kfree_sensitive()
As said by Linus:

  A symmetric naming is only helpful if it implies symmetries in use.
  Otherwise it's actively misleading.

  In "kzalloc()", the z is meaningful and an important part of what the
  caller wants.

  In "kzfree()", the z is actively detrimental, because maybe in the
  future we really _might_ want to use that "memfill(0xdeadbeef)" or
  something. The "zero" part of the interface isn't even _relevant_.

The main reason that kzfree() exists is to clear sensitive information
that should not be leaked to other future users of the same memory
objects.

Rename kzfree() to kfree_sensitive() to follow the example of the recently
added kvfree_sensitive() and make the intention of the API more explicit.
In addition, memzero_explicit() is used to clear the memory to make sure
that it won't get optimized away by the compiler.

The renaming is done by using the command sequence:

  git grep -w --name-only kzfree |\
  xargs sed -i 's/kzfree/kfree_sensitive/'

followed by some editing of the kfree_sensitive() kerneldoc and adding
a kzfree backward compatibility macro in slab.h.

[akpm@linux-foundation.org: fs/crypto/inline_crypt.c needs linux/slab.h]
[akpm@linux-foundation.org: fix fs/crypto/inline_crypt.c some more]

Suggested-by: Joe Perches <joe@perches.com>
Signed-off-by: Waiman Long <longman@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Acked-by: David Howells <dhowells@redhat.com>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Jarkko Sakkinen <jarkko.sakkinen@linux.intel.com>
Cc: James Morris <jmorris@namei.org>
Cc: "Serge E. Hallyn" <serge@hallyn.com>
Cc: Joe Perches <joe@perches.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Dan Carpenter <dan.carpenter@oracle.com>
Cc: "Jason A . Donenfeld" <Jason@zx2c4.com>
Link: http://lkml.kernel.org/r/20200616154311.12314-3-longman@redhat.com
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2020-08-07 11:33:22 -07:00
Linus Torvalds
86cfccb669 dlm for 5.9
This set includes a some improvements to the dlm
 networking layer: improving the ability to trace
 dlm messages for debugging, and improved handling
 of bad messages or disrupted connections.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJfLCPxAAoJEDgbc8f8gGmqz04P/2hvv/4rXo9AOgnnstvZV1Qy
 Yo01Cy807vB1c3jhIJryM2gG61GNH22RAHc2NcfjJwy04HH/1IEr6P48Po3qYEnS
 8fZ8B9msxpsujVOrRoeBuLN8elI1HftyNVWaVjH7xtD+fLCDLu9i10kv3aeS+DiB
 T6f7yQQv7hgXS3xGvlMr2//aLwGD2ZdcRbkOEGo+k7yUjQbIDH/wdZWcPLh6y4yT
 p20i2ulYKjEZFmXDMa17diONISeGO6iaDhee24XPDwNDp8qI1iPGJsmxltMmn8Qf
 d2HPF1IDh4eM8lCwmqBtjYTnJd6rAW0v3+Ek1+wzQKVeXLFiz/MEyuOldtpsqmMO
 8Og0vr6zfTCjFo8uvyj+cF7Fcj0yIPWg1yb7EauqqxreK8V9GBA1V2ZXYVd8xwea
 thrAUaq8f+PYQ9uy1FsN3xaO3BFN1VpcvHu4/3gU3OudnZZt2Ae670RYHKC0bq8D
 2tSsqaiDnlvniHgh4xvtNIvRANkDS1ZSbkUPZhMHL7DnRJn66oDIfCr7NMbZwvCa
 AS0q6suUFyXFbAEJcY6XWxe3aQ3WuxIClT84MgzX/dAK2Qcl8ryWGGSVc0dp4Vl1
 cd8MtmpnIWsnxqNRl4jn6cfolDheaxL8nouLtJ+3/dC9VkyDyfmrtnM+8aTZKHoa
 3/xrBuVkEJAwkAAr8Pb8
 =qgti
 -----END PGP SIGNATURE-----

Merge tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm

Pull dlm updates from David Teigland:
 "This set includes a some improvements to the dlm networking layer:
  improving the ability to trace dlm messages for debugging, and
  improved handling of bad messages or disrupted connections"

* tag 'dlm-5.9' of git://git.kernel.org/pub/scm/linux/kernel/git/teigland/linux-dlm:
  fs: dlm: implement tcp graceful shutdown
  fs: dlm: change handling of reconnects
  fs: dlm: don't close socket on invalid message
  fs: dlm: set skb mark per peer socket
  fs: dlm: set skb mark for listen socket
  net: sock: add sock_set_mark
  dlm: Fix kobject memleak
2020-08-06 19:44:25 -07:00
Linus Torvalds
96e3f3c16b - Add support to enable/disable the thermal zones resulting on core code and
drivers cleanup (Andrzej Pietrasiewicz)
 
 - Add generic netlink support for userspace notifications: events, temperature
   and discovery commands (Daniel Lezcano)
 
 - Fix redundant initialization for a ret variable (Colin Ian King)
 
 - Remove the clock cooling code as it is used nowhere (Amit Kucheria)
 
 - Add the rcar_gen3_thermal's r8a774e1 support (Marian-Cristian Rotariu)
 
 - Replace all references to thermal.txt in the documentation to the
   corresponding yaml files (Amit Kucheria)
 
 - Add maintainer entry for the IPA (Lukasz Luba)
 
 - Add support for MSM8939 for the tsens (Shawn Guo)
 
 - Update power allocator and devfreq cooling to SPDX licensing (Lukasz Luba)
 
 - Add Cannon Lake Low Power PCH support (Sumeet Pawnikar)
 
 - Add tsensor support for V2 mediatek thermal system (Henry Yen)
 
 - Fix thermal zone lookup by ID for the core code (Thierry Reding)
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGn3N4YVz0WNVyHskqDIjiipP6E8FAl8q7tsACgkQqDIjiipP
 6E+5Rwf7BFEn5YXPvng8cmnAlgvEBc9DdT6mGSo0NpFm9MdUxXlaqvw3WWSGyqWQ
 +z0Ka7lmn5XyiMsVN11++Snp+79X17HzZf9SXO3glyIpAn+5prTDRhzzj0/jPrtS
 sEeI++DrILsKKMGVljzftLmwNJN9DkUDNcnmWmZdCDbYVEKtP9Pjf2wBjAnXj7sX
 JA3CkHRMwYLEQbfaKz37M11cYM+LqbDOlb6U11YWgAGGJ7d7zNYRf2/YSYPM4AN6
 iE6j0E+3jIlXesULsap1AzeJaBq+wFxj1FL2TUZ8KscvRrm3AucqzNAT2M/Bc5Az
 XLKKzc6Gp9JfqB5KXhX2EDu7VRnDBg==
 =cSMN
 -----END PGP SIGNATURE-----

Merge tag 'thermal-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux

Pull thermal updates from Daniel Lezcano:

 - Add support to enable/disable the thermal zones resulting on core
   code and drivers cleanup (Andrzej Pietrasiewicz)

 - Add generic netlink support for userspace notifications: events,
   temperature and discovery commands (Daniel Lezcano)

 - Fix redundant initialization for a ret variable (Colin Ian King)

 - Remove the clock cooling code as it is used nowhere (Amit Kucheria)

 - Add the rcar_gen3_thermal's r8a774e1 support (Marian-Cristian
   Rotariu)

 - Replace all references to thermal.txt in the documentation to the
   corresponding yaml files (Amit Kucheria)

 - Add maintainer entry for the IPA (Lukasz Luba)

 - Add support for MSM8939 for the tsens (Shawn Guo)

 - Update power allocator and devfreq cooling to SPDX licensing (Lukasz
   Luba)

 - Add Cannon Lake Low Power PCH support (Sumeet Pawnikar)

 - Add tsensor support for V2 mediatek thermal system (Henry Yen)

 - Fix thermal zone lookup by ID for the core code (Thierry Reding)

* tag 'thermal-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/thermal/linux: (40 commits)
  thermal: intel: intel_pch_thermal: Add Cannon Lake Low Power PCH support
  thermal: mediatek: Add tsensor support for V2 thermal system
  thermal: mediatek: Prepare to add support for other platforms
  thermal: Update power allocator and devfreq cooling to SPDX licensing
  MAINTAINERS: update entry to thermal governors file name prefixing
  thermal: core: Add thermal zone enable/disable notification
  thermal: qcom: tsens-v0_1: Add support for MSM8939
  dt-bindings: tsens: qcom: Document MSM8939 compatible
  thermal: core: Fix thermal zone lookup by ID
  thermal: int340x: processor_thermal: fix: update Jasper Lake PCI id
  thermal: imx8mm: Support module autoloading
  thermal: ti-soc-thermal: Fix reversed condition in ti_thermal_expose_sensor()
  MAINTAINERS: Add maintenance information for IPA
  thermal: rcar_gen3_thermal: Do not shadow thcode variable
  dt-bindings: thermal: Get rid of thermal.txt and replace references
  thermal: core: Move initialization after core initcall
  thermal: netlink: Improve the initcall ordering
  net: genetlink: Move initialization to core_initcall
  thermal: rcar_gen3_thermal: Add r8a774e1 support
  thermal/drivers/clock_cooling: Remove clock_cooling code
  ...
2020-08-06 18:10:55 -07:00
Yonghong Song
5e7b30205c bpf: Change uapi for bpf iterator map elements
Commit a5cbe05a66 ("bpf: Implement bpf iterator for
map elements") added bpf iterator support for
map elements. The map element bpf iterator requires
info to identify a particular map. In the above
commit, the attr->link_create.target_fd is used
to carry map_fd and an enum bpf_iter_link_info
is added to uapi to specify the target_fd actually
representing a map_fd:
    enum bpf_iter_link_info {
	BPF_ITER_LINK_UNSPEC = 0,
	BPF_ITER_LINK_MAP_FD = 1,

	MAX_BPF_ITER_LINK_INFO,
    };

This is an extensible approach as we can grow
enumerator for pid, cgroup_id, etc. and we can
unionize target_fd for pid, cgroup_id, etc.
But in the future, there are chances that
more complex customization may happen, e.g.,
for tasks, it could be filtered based on
both cgroup_id and user_id.

This patch changed the uapi to have fields
	__aligned_u64	iter_info;
	__u32		iter_info_len;
for additional iter_info for link_create.
The iter_info is defined as
	union bpf_iter_link_info {
		struct {
			__u32   map_fd;
		} map;
	};

So future extension for additional customization
will be easier. The bpf_iter_link_info will be
passed to target callback to validate and generic
bpf_iter framework does not need to deal it any
more.

Note that map_fd = 0 will be considered invalid
and -EBADF will be returned to user space.

Fixes: a5cbe05a66 ("bpf: Implement bpf iterator for map elements")
Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200805055056.1457463-1-yhs@fb.com
2020-08-06 16:39:14 -07:00
Alexander Aring
84d1c61740 net: sock: add sock_set_mark
This patch adds a new socket helper function to set the mark value for a
kernel socket.

Signed-off-by: Alexander Aring <aahringo@redhat.com>
Signed-off-by: David Teigland <teigland@redhat.com>
2020-08-06 10:30:50 -05:00
Linus Torvalds
47ec5303d7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan.

 2) Support UDP segmentation in code TSO code, from Eric Dumazet.

 3) Allow flashing different flash images in cxgb4 driver, from Vishal
    Kulkarni.

 4) Add drop frames counter and flow status to tc flower offloading,
    from Po Liu.

 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni.

 6) Various new indirect call avoidance, from Eric Dumazet and Brian
    Vazquez.

 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from
    Yonghong Song.

 8) Support querying and setting hardware address of a port function via
    devlink, use this in mlx5, from Parav Pandit.

 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson.

10) Switch qca8k driver over to phylink, from Jonathan McDowell.

11) In bpftool, show list of processes holding BPF FD references to
    maps, programs, links, and btf objects. From Andrii Nakryiko.

12) Several conversions over to generic power management, from Vaibhav
    Gupta.

13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry
    Yakunin.

14) Various https url conversions, from Alexander A. Klimov.

15) Timestamping and PHC support for mscc PHY driver, from Antoine
    Tenart.

16) Support bpf iterating over tcp and udp sockets, from Yonghong Song.

17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov.

18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan.

19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several
    drivers. From Luc Van Oostenryck.

20) XDP support for xen-netfront, from Denis Kirjanov.

21) Support receive buffer autotuning in MPTCP, from Florian Westphal.

22) Support EF100 chip in sfc driver, from Edward Cree.

23) Add XDP support to mvpp2 driver, from Matteo Croce.

24) Support MPTCP in sock_diag, from Paolo Abeni.

25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic
    infrastructure, from Jakub Kicinski.

26) Several pci_ --> dma_ API conversions, from Christophe JAILLET.

27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel.

28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki.

29) Refactor a lot of networking socket option handling code in order to
    avoid set_fs() calls, from Christoph Hellwig.

30) Add rfc4884 support to icmp code, from Willem de Bruijn.

31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei.

32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin.

33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin.

34) Support TCP syncookies in MPTCP, from Flowian Westphal.

35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano
    Brivio.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits)
  net: thunderx: initialize VF's mailbox mutex before first usage
  usb: hso: remove bogus check for EINPROGRESS
  usb: hso: no complaint about kmalloc failure
  hso: fix bailout in error case of probe
  ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
  selftests/net: relax cpu affinity requirement in msg_zerocopy test
  mptcp: be careful on subflow creation
  selftests: rtnetlink: make kci_test_encap() return sub-test result
  selftests: rtnetlink: correct the final return value for the test
  net: dsa: sja1105: use detected device id instead of DT one on mismatch
  tipc: set ub->ifindex for local ipv6 address
  ipv6: add ipv6_dev_find()
  net: openvswitch: silence suspicious RCU usage warning
  Revert "vxlan: fix tos value before xmit"
  ptp: only allow phase values lower than 1 period
  farsync: switch from 'pci_' to 'dma_' API
  wan: wanxl: switch from 'pci_' to 'dma_' API
  hv_netvsc: do not use VF device if link is down
  dpaa2-eth: Fix passing zero to 'PTR_ERR' warning
  net: macb: Properly handle phylink on at91sam9x
  ...
2020-08-05 20:13:21 -07:00
Stefano Brivio
8ed54f167a ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM
On architectures defining _HAVE_ARCH_IPV6_CSUM, we get
csum_ipv6_magic() defined by means of arch checksum.h headers. On
other architectures, we actually need to include net/ip6_checksum.h
to be able to use it.

Without this include, building with defconfig breaks at least for
s390.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Fixes: 4cb47a8644 ("tunnels: PMTU discovery support for directly bridged IP packets")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05 12:30:36 -07:00
Paolo Abeni
adf7341064 mptcp: be careful on subflow creation
Nicolas reported the following oops:

[ 1521.392541] BUG: kernel NULL pointer dereference, address: 00000000000000c0
[ 1521.394189] #PF: supervisor read access in kernel mode
[ 1521.395376] #PF: error_code(0x0000) - not-present page
[ 1521.396607] PGD 0 P4D 0
[ 1521.397156] Oops: 0000 [#1] SMP PTI
[ 1521.398020] CPU: 0 PID: 22986 Comm: kworker/0:2 Not tainted 5.8.0-rc4+ #109
[ 1521.399618] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
[ 1521.401728] Workqueue: events mptcp_worker
[ 1521.402651] RIP: 0010:mptcp_subflow_create_socket+0xf1/0x1c0
[ 1521.403954] Code: 24 08 89 44 24 04 48 8b 7a 18 e8 2a 48 d4 ff 8b 44 24 04 85 c0 75 7a 48 8b 8b 78 02 00 00 48 8b 54 24 08 48 8d bb 80 00 00 00 <48> 8b 89 c0 00 00 00 48 89 8a c0 00 00 00 48 8b 8b 78 02 00 00 8b
[ 1521.408201] RSP: 0000:ffffabc4002d3c60 EFLAGS: 00010246
[ 1521.409433] RAX: 0000000000000000 RBX: ffffa0b9ad8c9a00 RCX: 0000000000000000
[ 1521.411096] RDX: ffffa0b9ae78a300 RSI: 00000000fffffe01 RDI: ffffa0b9ad8c9a80
[ 1521.412734] RBP: ffffa0b9adff2e80 R08: ffffa0b9af02d640 R09: ffffa0b9ad923a00
[ 1521.414333] R10: ffffabc4007139f8 R11: fefefefefefefeff R12: ffffabc4002d3cb0
[ 1521.415918] R13: ffffa0b9ad91fa58 R14: ffffa0b9ad8c9f9c R15: 0000000000000000
[ 1521.417592] FS:  0000000000000000(0000) GS:ffffa0b9af000000(0000) knlGS:0000000000000000
[ 1521.419490] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1521.420839] CR2: 00000000000000c0 CR3: 000000002951e006 CR4: 0000000000160ef0
[ 1521.422511] Call Trace:
[ 1521.423103]  __mptcp_subflow_connect+0x94/0x1f0
[ 1521.425376]  mptcp_pm_create_subflow_or_signal_addr+0x200/0x2a0
[ 1521.426736]  mptcp_worker+0x31b/0x390
[ 1521.431324]  process_one_work+0x1fc/0x3f0
[ 1521.432268]  worker_thread+0x2d/0x3b0
[ 1521.434197]  kthread+0x117/0x130
[ 1521.435783]  ret_from_fork+0x22/0x30

on some unconventional configuration.

The MPTCP protocol is trying to create a subflow for an
unaccepted server socket. That is allowed by the RFC, even
if subflow creation will likely fail.
Unaccepted sockets have still a NULL sk_socket field,
avoid the issue by failing earlier.

Reported-and-tested-by: Nicolas Rybowski <nicolas.rybowski@tessares.net>
Fixes: 7d14b0d2b9 ("mptcp: set correct vfs info for subflows")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05 12:24:20 -07:00
Xin Long
5a6f6f5791 tipc: set ub->ifindex for local ipv6 address
Without ub->ifindex set for ipv6 address in tipc_udp_enable(),
ipv6_sock_mc_join() may make the wrong dev join the multicast
address in enable_mcast(). This causes that tipc links would
never be created.

So fix it by getting the right netdev and setting ub->ifindex,
as it does for ipv4 address.

Reported-by: Shuang Li <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05 12:19:52 -07:00
Xin Long
81f6cb3122 ipv6: add ipv6_dev_find()
This is to add an ip_dev_find like function for ipv6, used to find
the dev by saddr.

It will be used by TIPC protocol. So also export it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05 12:19:52 -07:00
Tonghao Zhang
5845589ed6 net: openvswitch: silence suspicious RCU usage warning
ovs_flow_tbl_destroy always is called from RCU callback
or error path. It is no need to check if rcu_read_lock
or lockdep_ovsl_is_held was held.

ovs_dp_cmd_fill_info always is called with ovs_mutex,
So use the rcu_dereference_ovsl instead of rcu_dereference
in ovs_flow_tbl_masks_cache_size.

Fixes: 9bf24f594c ("net: openvswitch: make masks cache size configurable")
Cc: Eelco Chaudron <echaudro@redhat.com>
Reported-by: syzbot+c0eb9e7cdde04e4eb4be@syzkaller.appspotmail.com
Reported-by: syzbot+f612c02823acb02ff9bc@syzkaller.appspotmail.com
Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-05 12:11:46 -07:00
Linus Torvalds
2324d50d05 It's been a busy cycle for documentation - hopefully the busiest for a
while to come.  Changes include:
 
  - Some new Chinese translations
 
  - Progress on the battle against double words words and non-HTTPS URLs
 
  - Some block-mq documentation
 
  - More RST conversions from Mauro.  At this point, that task is
    essentially complete, so we shouldn't see this kind of churn again for a
    while.  Unless we decide to switch to asciidoc or something...:)
 
  - Lots of typo fixes, warning fixes, and more.
 -----BEGIN PGP SIGNATURE-----
 
 iQFDBAABCAAtFiEEIw+MvkEiF49krdp9F0NaE2wMflgFAl8oVkwPHGNvcmJldEBs
 d24ubmV0AAoJEBdDWhNsDH5YoW8H/jJ/xnXFn7tkgVPQAlL3k5HCnK7A5nDP9RVR
 cg1pTx1cEFdjzxPlJyExU6/v+AImOvtweHXC+JDK7YcJ6XFUNYXJI3LxL5KwUXbY
 BL/xRFszDSXH2C7SJF5GECcFYp01e/FWSLN3yWAh+g+XwsKiTJ8q9+CoIDkHfPGO
 7oQsHKFu6s36Af0LfSgxk4sVB7EJbo8e4psuPsP5SUrl+oXRO43Put0rXkR4yJoH
 9oOaB51Do5fZp8I4JVAqGXvpXoExyLMO4yw0mASm6YSZ3KyjR8Fae+HD9Cq4ZuwY
 0uzb9K+9NEhqbfwtyBsi99S64/6Zo/MonwKwevZuhtsDTK4l4iU=
 =JQLZ
 -----END PGP SIGNATURE-----

Merge tag 'docs-5.9' of git://git.lwn.net/linux

Pull documentation updates from Jonathan Corbet:
 "It's been a busy cycle for documentation - hopefully the busiest for a
  while to come. Changes include:

   - Some new Chinese translations

   - Progress on the battle against double words words and non-HTTPS
     URLs

   - Some block-mq documentation

   - More RST conversions from Mauro. At this point, that task is
     essentially complete, so we shouldn't see this kind of churn again
     for a while. Unless we decide to switch to asciidoc or
     something...:)

   - Lots of typo fixes, warning fixes, and more"

* tag 'docs-5.9' of git://git.lwn.net/linux: (195 commits)
  scripts/kernel-doc: optionally treat warnings as errors
  docs: ia64: correct typo
  mailmap: add entry for <alobakin@marvell.com>
  doc/zh_CN: add cpu-load Chinese version
  Documentation/admin-guide: tainted-kernels: fix spelling mistake
  MAINTAINERS: adjust kprobes.rst entry to new location
  devices.txt: document rfkill allocation
  PCI: correct flag name
  docs: filesystems: vfs: correct flag name
  docs: filesystems: vfs: correct sync_mode flag names
  docs: path-lookup: markup fixes for emphasis
  docs: path-lookup: more markup fixes
  docs: path-lookup: fix HTML entity mojibake
  CREDITS: Replace HTTP links with HTTPS ones
  docs: process: Add an example for creating a fixes tag
  doc/zh_CN: add Chinese translation prefer section
  doc/zh_CN: add clearing-warn-once Chinese version
  doc/zh_CN: add admin-guide index
  doc:it_IT: process: coding-style.rst: Correct __maybe_unused compiler label
  futex: MAINTAINERS: Re-add selftests directory
  ...
2020-08-04 22:47:54 -07:00
Olga Kornievskaia
7de62bc09f SUNRPC dont update timeout value on connection reset
Current behaviour: every time a v3 operation is re-sent to the server
we update (double) the timeout. There is no distinction between whether
or not the previous timer had expired before the re-sent happened.

Here's the scenario:
1. Client sends a v3 operation
2. Server RST-s the connection (prior to the timeout) (eg., connection
is immediately reset)
3. Client re-sends a v3 operation but the timeout is now 120sec.

As a result, an application sees 2mins pause before a retry in case
server again does not reply.

Instead, this patch proposes to keep track off when the minor timeout
should happen and if it didn't, then don't update the new timeout.
Value is updated based on the previous value to make timeouts
predictable.

Signed-off-by: Olga Kornievskaia <kolga@netapp.com>
Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
2020-08-04 23:17:11 -04:00
Linus Torvalds
3950e97543 Merge branch 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace
Pull execve updates from Eric Biederman:
 "During the development of v5.7 I ran into bugs and quality of
  implementation issues related to exec that could not be easily fixed
  because of the way exec is implemented. So I have been diggin into
  exec and cleaning up what I can.

  This cycle I have been looking at different ideas and different
  implementations to see what is possible to improve exec, and cleaning
  the way exec interfaces with in kernel users. Only cleaning up the
  interfaces of exec with rest of the kernel has managed to stabalize
  and make it through review in time for v5.9-rc1 resulting in 2 sets of
  changes this cycle.

   - Implement kernel_execve

   - Make the user mode driver code a better citizen

  With kernel_execve the code size got a little larger as the copying of
  parameters from userspace and copying of parameters from userspace is
  now separate. The good news is kernel threads no longer need to play
  games with set_fs to use exec. Which when combined with the rest of
  Christophs set_fs changes should security bugs with set_fs much more
  difficult"

* 'exec-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (23 commits)
  exec: Implement kernel_execve
  exec: Factor bprm_stack_limits out of prepare_arg_pages
  exec: Factor bprm_execve out of do_execve_common
  exec: Move bprm_mm_init into alloc_bprm
  exec: Move initialization of bprm->filename into alloc_bprm
  exec: Factor out alloc_bprm
  exec: Remove unnecessary spaces from binfmts.h
  umd: Stop using split_argv
  umd: Remove exit_umh
  bpfilter: Take advantage of the facilities of struct pid
  exit: Factor thread_group_exited out of pidfd_poll
  umd: Track user space drivers with struct pid
  bpfilter: Move bpfilter_umh back into init data
  exec: Remove do_execve_file
  umh: Stop calling do_execve_file
  umd: Transform fork_usermode_blob into fork_usermode_driver
  umd: Rename umd_info.cmdline umd_info.driver_name
  umd: For clarity rename umh_info umd_info
  umh: Separate the user mode driver and the user mode helper support
  umh: Remove call_usermodehelper_setup_file.
  ...
2020-08-04 14:27:25 -07:00
Linus Torvalds
fd76a74d94 audit/stable-5.9 PR 20200803
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEES0KozwfymdVUl37v6iDy2pc3iXMFAl8okpIUHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQ6iDy2pc3iXNqOQ/8D+m9Ykcby3csEKsp8YtsaukEu62U
 lRVaxzRNO9wwB24aFwDFuJnIkmsSi/s/O4nBsy2mw+Apn+uDCvHQ9tBU07vlNn2f
 lu27YaTya7YGlqoe315xijd8tyoX99k8cpQeixvAVr9/jdR09yka7SJ8O7X9mjV7
 +SUVDiKCplPKpiwCCRS9cqD7F64T6y35XKzbrzYqdP0UOF2XelZo/Evt5rDRvWUf
 5qDN2tP+iM/Fvu5lCfczFwAeivfAdxjQ11n783hx8Ms2qyiaKQCzbEwjqAslmkbs
 1k/+ED0NjzXX1ne0JZaz/bk0wsMnmOoa8o+NDcyd7Za/cj5prUZi7kBy+xry4YV8
 qKJ40Lk0flCWgUpm6bkYVOByIYHk0gmfBNvjilqf25NR/eOC/9e9ir8PywvYUW/7
 kvVK37+N/a3LnFj80sZpIeqqnNU8z9PV1i7//5/kDuKvz94Bq83TJDO6pPKvqDtC
 njQfCFoHwdEeF8OalK793lIiYaoODqvbkWKChKMqziODJ4ZP8AW06gXpEbEWn7G3
 TTnJx7hqzR9t90vBQJeO3Fromfn+9TDlZVdX+EGO8gIqUiLGr0r7LPPep4VkDbNw
 LxMYKeC2cgRp8Z+XXPDxfXSDL2psTwg6CXcDrXcYnUyBo/yerpBvbJkeaR0h+UR0
 j6cvMX+T39X2JXM=
 =Xs3M
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit updates from Paul Moore:
 "Aside from some smaller bug fixes, here are the highlights:

   - add a new backlog wait metric to the audit status message, this is
     intended to help admins determine how long processes have been
     waiting for the audit backlog queue to clear

   - generate audit records for nftables configuration changes

   - generate CWD audit records for for the relevant LSM audit records"

* tag 'audit-pr-20200803' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: report audit wait metric in audit status reply
  audit: purge audit_log_string from the intra-kernel audit API
  audit: issue CWD record to accompany LSM_AUDIT_DATA_* records
  audit: use the proper gfp flags in the audit_log_nfcfg() calls
  audit: remove unused !CONFIG_AUDITSYSCALL __audit_inode* stubs
  audit: add gfp parameter to audit_log_nfcfg
  audit: log nftables configuration change events
  audit: Use struct_size() helper in alloc_chunk
2020-08-04 14:20:26 -07:00
Linus Torvalds
9ecc6ea491 seccomp updates for v5.9-rc1
- Improved selftest coverage, timeouts, and reporting
 - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)
 - Refactor __scm_install_fd() into __receive_fd() and fix buggy callers
 - Introduce "addfd" command for SECCOMP_RET_USER_NOTIF (Sargun Dhillon)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oZcQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJomDD/4x3j7eXREcXDsHOmlgEaHWGx4l
 JldHFQhV5GjmD7gOkPcoZSG7NfG7F6VpwAJg7ZoR3qUkem7K8DFucxqgo1RldCot
 nigleeLX6JeMS0Z+iwjAVZd+5t4xG4J/7GGDHIIMiG5qvwJ0Yf64o1bkjaB2Q/Bv
 tluBg0WF32kFMG/ZwyY/V2QDbbue97CFPflybOh1o2nWbVzmUlFEEum3UUvZsxc8
 smMsattJyuAV7kcEKzKrs8b010NdFZqwdbub5Np9W3XEXGBYMdIPoNsOQGmB9wby
 j2ui0lzboXRG997jM7TCd1l/XZAv8aAwvPplw3FJRybzkOGs9NDyLMoz87yJpR1T
 xp511vnMyMbyKIGdungkt7cIyzaictHwaYzznsmuNdCPEjTaIQJr1ctsa4GEgtqf
 pnkktZ9YbMCcHU0CtZ8GlOVqA9wE+FUm0/u0zgikzJQsB+HcNItiARTTTHRyco7p
 VJCqK8o4Zx4ELV7QNkSH4nhFkVgRopvrvBiPAGro/qwGOofBg8W8wM8O1+V/MDmp
 zSU22v4SncT1Xb7dtmdJqDEeHfDikhaCAb4Je2hsGQWzbdAqwHGlpa7vpk9x3Q5r
 L+XyP+Z+rPHlXYyypJwUvvOQhXOmP0zYxcEHxByqIBfXiwy+3dN4tDDfatWbccwl
 uTlTDM8kmQn6QzSztA==
 =yb55
 -----END PGP SIGNATURE-----

Merge tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull seccomp updates from Kees Cook:
 "There are a bunch of clean ups and selftest improvements along with
  two major updates to the SECCOMP_RET_USER_NOTIF filter return:
  EPOLLHUP support to more easily detect the death of a monitored
  process, and being able to inject fds when intercepting syscalls that
  expect an fd-opening side-effect (needed by both container folks and
  Chrome). The latter continued the refactoring of __scm_install_fd()
  started by Christoph, and in the process found and fixed a handful of
  bugs in various callers.

   - Improved selftest coverage, timeouts, and reporting

   - Add EPOLLHUP support for SECCOMP_RET_USER_NOTIF (Christian Brauner)

   - Refactor __scm_install_fd() into __receive_fd() and fix buggy
     callers

   - Introduce 'addfd' command for SECCOMP_RET_USER_NOTIF (Sargun
     Dhillon)"

* tag 'seccomp-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: (30 commits)
  selftests/seccomp: Test SECCOMP_IOCTL_NOTIF_ADDFD
  seccomp: Introduce addfd ioctl to seccomp user notifier
  fs: Expand __receive_fd() to accept existing fd
  pidfd: Replace open-coded receive_fd()
  fs: Add receive_fd() wrapper for __receive_fd()
  fs: Move __scm_install_fd() to __receive_fd()
  net/scm: Regularize compat handling of scm_detach_fds()
  pidfd: Add missing sock updates for pidfd_getfd()
  net/compat: Add missing sock updates for SCM_RIGHTS
  selftests/seccomp: Check ENOSYS under tracing
  selftests/seccomp: Refactor to use fixture variants
  selftests/harness: Clean up kern-doc for fixtures
  seccomp: Use -1 marker for end of mode 1 syscall list
  seccomp: Fix ioctl number for SECCOMP_IOCTL_NOTIF_ID_VALID
  selftests/seccomp: Rename user_trap_syscall() to user_notif_syscall()
  selftests/seccomp: Make kcmp() less required
  seccomp: Use pr_fmt
  selftests/seccomp: Improve calibration loop
  selftests/seccomp: use 90s as timeout
  selftests/seccomp: Expand benchmark to per-filter measurements
  ...
2020-08-04 14:11:08 -07:00
Linus Torvalds
99ea1521a0 Remove uninitialized_var() macro for v5.9-rc1
- Clean up non-trivial uses of uninitialized_var()
 - Update documentation and checkpatch for uninitialized_var() removal
 - Treewide removal of uninitialized_var()
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oYLQWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJsfjEACvf0D3WL3H7sLHtZ2HeMwOgAzq
 il08t6vUscINQwiIIK3Be43ok3uQ1Q+bj8sr2gSYTwunV2IYHFferzgzhyMMno3o
 XBIGd1E+v1E4DGBOiRXJvacBivKrfvrdZ7AWiGlVBKfg2E0fL1aQbe9AYJ6eJSbp
 UGqkBkE207dugS5SQcwrlk1tWKUL089lhDAPd7iy/5RK76OsLRCJFzIerLHF2ZK2
 BwvA+NWXVQI6pNZ0aRtEtbbxwEU4X+2J/uaXH5kJDszMwRrgBT2qoedVu5LXFPi8
 +B84IzM2lii1HAFbrFlRyL/EMueVFzieN40EOB6O8wt60Y4iCy5wOUzAdZwFuSTI
 h0xT3JI8BWtpB3W+ryas9cl9GoOHHtPA8dShuV+Y+Q2bWe1Fs6kTl2Z4m4zKq56z
 63wQCdveFOkqiCLZb8s6FhnS11wKtAX4czvXRXaUPgdVQS1Ibyba851CRHIEY+9I
 AbtogoPN8FXzLsJn7pIxHR4ADz+eZ0dQ18f2hhQpP6/co65bYizNP5H3h+t9hGHG
 k3r2k8T+jpFPaddpZMvRvIVD8O2HvJZQTyY6Vvneuv6pnQWtr2DqPFn2YooRnzoa
 dbBMtpon+vYz6OWokC5QNWLqHWqvY9TmMfcVFUXE4AFse8vh4wJ8jJCNOFVp8On+
 drhmmImUr1YylrtVOw==
 =xHmk
 -----END PGP SIGNATURE-----

Merge tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull uninitialized_var() macro removal from Kees Cook:
 "This is long overdue, and has hidden too many bugs over the years. The
  series has several "by hand" fixes, and then a trivial treewide
  replacement.

   - Clean up non-trivial uses of uninitialized_var()

   - Update documentation and checkpatch for uninitialized_var() removal

   - Treewide removal of uninitialized_var()"

* tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  compiler: Remove uninitialized_var() macro
  treewide: Remove uninitialized_var() usage
  checkpatch: Remove awareness of uninitialized_var() macro
  mm/debug_vm_pgtable: Remove uninitialized_var() usage
  f2fs: Eliminate usage of uninitialized_var() macro
  media: sur40: Remove uninitialized_var() usage
  KVM: PPC: Book3S PR: Remove uninitialized_var() usage
  clk: spear: Remove uninitialized_var() usage
  clk: st: Remove uninitialized_var() usage
  spi: davinci: Remove uninitialized_var() usage
  ide: Remove uninitialized_var() usage
  rtlwifi: rtl8192cu: Remove uninitialized_var() usage
  b43: Remove uninitialized_var() usage
  drbd: Remove uninitialized_var() usage
  x86/mm/numa: Remove uninitialized_var() usage
  docs: deprecated.rst: Add uninitialized_var()
2020-08-04 13:49:43 -07:00
Linus Torvalds
427714f258 tasklets API update for v5.9-rc1
- Prepare for tasklet API modernization (Romain Perier, Allen Pais, Kees Cook)
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEEpcP2jyKd1g9yPm4TiXL039xtwCYFAl8oXpMWHGtlZXNjb29r
 QGNocm9taXVtLm9yZwAKCRCJcvTf3G3AJtJgEACVb88nzYwu5mC5ZcfvwSyXeQsR
 eDpCkX5HT6CsxlOn0/YJvxUtkkerQftbRuAXrzoUpQkpyBh82PviVZFKDS7NE9Lc
 6xPqloi2gbZ8EfgMraVynL+9lpLh0+qNCM7LPg4xT+JxMDLut/nWRdrp8d7uBYfQ
 AXV6CV4Tc4ijOMROV6AEVVdSTzkRCbiqUnRDBLETBfiJOdDn5MgJgxicWvN5FTpu
 PiUVF3CtWaKCRfQO/GEAXTG65hOtmql5IbX9n7uooNu/wCCnEFfVUus1uTcsrqxN
 ByrZ56NVPoO7z2jYLt8Lft3myo2e/mn88PKqrzS2p9GPn0VBv7rcO3ePmbbHL/RA
 mp+pg8wdpmKrHv4YGfsF+obT1v8f6VJoTLUt5S/WqZAzl1sVJgEJdAkjmDKythGG
 yYKKCemMceMMzLXxnFAYMzdXzdXZ3YEpiW4UkBb77EhUisDrLxCHSL5t4UzyWnuO
 Gtzw7N69iHPHLsxAk1hESAD8sdlk2EdN6vzJVelOsiW955x1hpR+msvNpwZwBqdq
 A2h8VnnrxLK2APl93T5VW9T6kvhzaTwLhoCH+oKklE+U0XJTAYZ4D/AcRVghBvMg
 bC1+1vDx+t/S+8P308evPQnEygLtL2I+zpPnBA1DZzHRAoY8inCLc5HQOfr6pi/f
 koNTtKkmSSKaFSYITw==
 =hb+e
 -----END PGP SIGNATURE-----

Merge tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux

Pull tasklets API update from Kees Cook:
 "These are the infrastructure updates needed to support converting the
  tasklet API to something more modern (and hopefully for removal
  further down the road).

  There is a 300-patch series waiting in the wings to get set out to
  subsystem maintainers, but these changes need to be present in the
  kernel first. Since this has some treewide changes, I carried this
  series for -next instead of paining Thomas with it in -tip, but it's
  got his Ack.

  This is similar to the timer_struct modernization from a while back,
  but not nearly as messy (I hope). :)

   - Prepare for tasklet API modernization (Romain Perier, Allen Pais,
     Kees Cook)"

* tag 'tasklets-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux:
  tasklet: Introduce new initialization API
  treewide: Replace DECLARE_TASKLET() with DECLARE_TASKLET_OLD()
  usb: gadget: udc: Avoid tasklet passing a global
2020-08-04 13:40:35 -07:00
David S. Miller
ee895a30ef Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Flush the cleanup xtables worker to make sure destructors
   have completed, from Florian Westphal.

2) iifgroup is matching erroneously, also from Florian.

3) Add selftest for meta interface matching, from Florian Westphal.

4) Move nf_ct_offload_timeout() to header, from Roi Dayan.

5) Call nf_ct_offload_timeout() from flow_offload_add() to
   make sure garbage collection does not evict offloaded flow,
   from Roi Dayan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:32:39 -07:00
Stefano Brivio
4cb47a8644 tunnels: PMTU discovery support for directly bridged IP packets
It's currently possible to bridge Ethernet tunnels carrying IP
packets directly to external interfaces without assigning them
addresses and routes on the bridged network itself: this is the case
for UDP tunnels bridged with a standard bridge or by Open vSwitch.

PMTU discovery is currently broken with those configurations, because
the encapsulation effectively decreases the MTU of the link, and
while we are able to account for this using PMTU discovery on the
lower layer, we don't have a way to relay ICMP or ICMPv6 messages
needed by the sender, because we don't have valid routes to it.

On the other hand, as a tunnel endpoint, we can't fragment packets
as a general approach: this is for instance clearly forbidden for
VXLAN by RFC 7348, section 4.3:

   VTEPs MUST NOT fragment VXLAN packets.  Intermediate routers may
   fragment encapsulated VXLAN packets due to the larger frame size.
   The destination VTEP MAY silently discard such VXLAN fragments.

The same paragraph recommends that the MTU over the physical network
accomodates for encapsulations, but this isn't a practical option for
complex topologies, especially for typical Open vSwitch use cases.

Further, it states that:

   Other techniques like Path MTU discovery (see [RFC1191] and
   [RFC1981]) MAY be used to address this requirement as well.

Now, PMTU discovery already works for routed interfaces, we get
route exceptions created by the encapsulation device as they receive
ICMP Fragmentation Needed and ICMPv6 Packet Too Big messages, and
we already rebuild those messages with the appropriate MTU and route
them back to the sender.

Add the missing bits for bridged cases:

- checks in skb_tunnel_check_pmtu() to understand if it's appropriate
  to trigger a reply according to RFC 1122 section 3.2.2 for ICMP and
  RFC 4443 section 2.4 for ICMPv6. This function is already called by
  UDP tunnels

- a new function generating those ICMP or ICMPv6 replies. We can't
  reuse icmp_send() and icmp6_send() as we don't see the sender as a
  valid destination. This doesn't need to be generic, as we don't
  cover any other type of ICMP errors given that we only provide an
  encapsulation function to the sender

While at it, make the MTU check in skb_tunnel_check_pmtu() accurate:
we might receive GSO buffers here, and the passed headroom already
includes the inner MAC length, so we don't have to account for it
a second time (that would imply three MAC headers on the wire, but
there are just two).

This issue became visible while bridging IPv6 packets with 4500 bytes
of payload over GENEVE using IPv4 with a PMTU of 4000. Given the 50
bytes of encapsulation headroom, we would advertise MTU as 3950, and
we would reject fragmented IPv6 datagrams of 3958 bytes size on the
wire. We're exclusively dealing with network MTU here, though, so we
could get Ethernet frames up to 3964 octets in that case.

v2:
- moved skb_tunnel_check_pmtu() to ip_tunnel_core.c (David Ahern)
- split IPv4/IPv6 functions (David Ahern)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:01:45 -07:00
Stefano Brivio
df23bb18b4 ipv4: route: Ignore output interface in FIB lookup for PMTU route
Currently, processes sending traffic to a local bridge with an
encapsulation device as a port don't get ICMP errors if they exceed
the PMTU of the encapsulated link.

David Ahern suggested this as a hack, but it actually looks like
the correct solution: when we update the PMTU for a given destination
by means of updating or creating a route exception, the encapsulation
might trigger this because of PMTU discovery happening either on the
encapsulation device itself, or its lower layer. This happens on
bridged encapsulations only.

The output interface shouldn't matter, because we already have a
valid destination. Drop the output interface restriction from the
associated route lookup.

For UDP tunnels, we will now have a route exception created for the
encapsulation itself, with a MTU value reflecting its headroom, which
allows a bridge forwarding IP packets originated locally to deliver
errors back to the sending socket.

The behaviour is now consistent with IPv6 and verified with selftests
pmtu_ipv{4,6}_br_{geneve,vxlan}{4,6}_exception introduced later in
this series.

v2:
- reset output interface only for bridge ports (David Ahern)
- add and use netif_is_any_bridge_port() helper (David Ahern)

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-04 13:01:45 -07:00
David S. Miller
2e7199bd77 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-08-04

The following pull-request contains BPF updates for your *net-next* tree.

We've added 73 non-merge commits during the last 9 day(s) which contain
a total of 135 files changed, 4603 insertions(+), 1013 deletions(-).

The main changes are:

1) Implement bpf_link support for XDP. Also add LINK_DETACH operation for the BPF
   syscall allowing processes with BPF link FD to force-detach, from Andrii Nakryiko.

2) Add BPF iterator for map elements and to iterate all BPF programs for efficient
   in-kernel inspection, from Yonghong Song and Alexei Starovoitov.

3) Separate bpf_get_{stack,stackid}() helpers for perf events in BPF to avoid
   unwinder errors, from Song Liu.

4) Allow cgroup local storage map to be shared between programs on the same
   cgroup. Also extend BPF selftests with coverage, from YiFei Zhu.

5) Add BPF exception tables to ARM64 JIT in order to be able to JIT BPF_PROBE_MEM
   load instructions, from Jean-Philippe Brucker.

6) Follow-up fixes on BPF socket lookup in combination with reuseport group
   handling. Also add related BPF selftests, from Jakub Sitnicki.

7) Allow to use socket storage in BPF_PROG_TYPE_CGROUP_SOCK-typed programs for
   socket create/release as well as bind functions, from Stanislav Fomichev.

8) Fix an info leak in xsk_getsockopt() when retrieving XDP stats via old struct
   xdp_statistics, from Peilin Ye.

9) Fix PT_REGS_RC{,_CORE}() macros in libbpf for MIPS arch, from Jerry Crunchtime.

10) Extend BPF kernel test infra with skb->family and skb->{local,remote}_ip{4,6}
    fields and allow user space to specify skb->dev via ifindex, from Dmitry Yakunin.

11) Fix a bpftool segfault due to missing program type name and make it more robust
    to prevent them in future gaps, from Quentin Monnet.

12) Consolidate cgroup helper functions across selftests and fix a v6 localhost
    resolver issue, from John Fastabend.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:27:40 -07:00
David S. Miller
76769c38b4 mlx5-updates-2020-08-03
This patchset introduces some updates to mlx5 driver.
 
 1) Jakub converts mlx5 to use the new udp tunnel infrastructure.
    Starting with a hack to allow drivers to request a static configuration
    of the default vxlan port, and then a patch that converts mlx5.
 
 2) Parav implements change_carrier ndo for VF eswitch representors,
    to speedup link state control of representors netdevices.
 
 3) Alex Vesker, makes a simple update to software steering to fix an issue
    with push vlan action sequence
 
 4) Leon removes a redundant dump stack on error flow.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl8oRdgACgkQSD+KveBX
 +j4/LQgAkSjNzOaS7bVDzhoYL3aBQOMIzgocJUeVi7xXH8IO1uy55mNDrKBqjxbW
 dy9U9VsvV5i2V2qkkQLvHVkoDSg8Buo2Uxu4OrZHOLN0KfbFrra4VvmB1CzEBix8
 FICnQaZZcE7529P04TgZ8Mo9vRb5VdJFhqED5Nvegy+y8FolEsQYbjIoDBE6wa0j
 Meqa/29+XCE5FzTOjbbQWizAnRZMbkxtSSreDNgeHxke9eMSO+fmwKScng63QUfl
 7nfU6dW6A0d1kHhpL5RqAFOcmkpSdqYaA3SA+/8pPT9X3yOAkxE6KTKGIixpB9JX
 zQt+Wkna49jJ/JfDQB5vgww5c0HjAQ==
 =j0fG
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2020-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-08-03

This patchset introduces some updates to mlx5 driver.

1) Jakub converts mlx5 to use the new udp tunnel infrastructure.
   Starting with a hack to allow drivers to request a static configuration
   of the default vxlan port, and then a patch that converts mlx5.

2) Parav implements change_carrier ndo for VF eswitch representors,
   to speedup link state control of representors netdevices.

3) Alex Vesker, makes a simple update to software steering to fix an issue
   with push vlan action sequence

4) Leon removes a redundant dump stack on error flow.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:24:30 -07:00
Paolo Abeni
8555c6bfd5 mptcp: fix bogus sendmsg() return code under pressure
In case of memory pressure, mptcp_sendmsg() may call
sk_stream_wait_memory() after succesfully xmitting some
bytes. If the latter fails we currently return to the
user-space the error code, ignoring the succeful xmit.

Address the issue always checking for the xmitted bytes
before mptcp_sendmsg() completes.

Fixes: f296234c98 ("mptcp: Add handling of incoming MP_JOIN requests")
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:14:20 -07:00
Ido Schimmel
c88e11e047 devlink: Pass extack when setting trap's action and group's parameters
A later patch will refuse to set the action of certain traps in mlxsw
and also to change the policer binding of certain groups. Pass extack so
that failure could be communicated clearly to user space.

Reviewed-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-08-03 18:06:46 -07:00