Commit Graph

15428 Commits

Author SHA1 Message Date
Christoph Hellwig
9b115749ac ipv6: add ip6_sock_set_v6only
Add a helper to directly set the IPV6_V6ONLY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
c1f9ec5776 ipv4: add ip_sock_set_pktinfo
Add a helper to directly set the IP_PKTINFO sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
2de569bda2 ipv4: add ip_sock_set_mtu_discover
Add a helper to directly set the IP_MTU_DISCOVER sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com> [rxrpc bits]
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
db45c0ef25 ipv4: add ip_sock_set_recverr
Add a helper to directly set the IP_RECVERR sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
c4e446bf5a ipv4: add ip_sock_set_freebind
Add a helper to directly set the IP_FREEBIND sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
6ebf71bab9 ipv4: add ip_sock_set_tos
Add a helper to directly set the IP_TOS sockopt from kernel space without
going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
fe31a326a4 net: add sock_set_reuseport
Add a helper to directly set the SO_REUSEPORT sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:45 -07:00
Christoph Hellwig
26cfabf9cd net: add sock_set_rcvbuf
Add a helper to directly set the SO_RCVBUFFORCE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
ce3d9544ce net: add sock_set_keepalive
Add a helper to directly set the SO_KEEPALIVE sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
783da70e83 net: add sock_enable_timestamps
Add a helper to directly enable timestamps instead of setting the
SO_TIMESTAMP* sockopts from kernel space and going through a fake
uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
7594888c78 net: add sock_bindtoindex
Add a helper to directly set the SO_BINDTOIFINDEX sockopt from kernel
space without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
76ee0785f4 net: add sock_set_sndtimeo
Add a helper to directly set the SO_SNDTIMEO_NEW sockopt from kernel
space without going through a fake uaccess.  The interface is
simplified to only pass the seconds value, as that is the only
thing needed at the moment.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
6e43496745 net: add sock_set_priority
Add a helper to directly set the SO_PRIORITY sockopt from kernel space
without going through a fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
c433594c07 net: add sock_no_linger
Add a helper to directly set the SO_LINGER sockopt from kernel space
with onoff set to true and a linger time of 0 without going through a
fake uaccess.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Christoph Hellwig
b58f0e8f38 net: add sock_set_reuseaddr
Add a helper to directly set the SO_REUSEADDR sockopt from kernel space
without going through a fake uaccess.

For this the iscsi target now has to formally depend on inet to avoid
a mostly theoretical compile failure.  For actual operation it already
did depend on having ipv4 or ipv6 support.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Sagi Grimberg <sagi@grimberg.me>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:11:44 -07:00
Eric Dumazet
d29245692a tcp: ipv6: support RFC 6069 (TCP-LD)
Make tcp_ld_RTO_revert() helper available to IPv6, and
implement RFC 6069 :

Quoting this RFC :

3. Connectivity Disruption Indication

   For Internet Protocol version 6 (IPv6) [RFC2460], the counterpart of
   the ICMP destination unreachable message of code 0 (net unreachable)
   and of code 1 (host unreachable) is the ICMPv6 destination
   unreachable message of code 0 (no route to destination) [RFC4443].
   As with IPv4, a router should generate an ICMPv6 destination
   unreachable message of code 0 in response to a packet that cannot be
   delivered to its destination address because it lacks a matching
   entry in its routing table.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-28 11:02:46 -07:00
Jonas Falkevik
50ce4c099b sctp: fix typo sctp_ulpevent_nofity_peer_addr_change
change typo in function name "nofity" to "notify"
sctp_ulpevent_nofity_peer_addr_change ->
sctp_ulpevent_notify_peer_addr_change

Signed-off-by: Jonas Falkevik <jonas.falkevik@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:08:02 -07:00
Tariq Toukan
b3ae2459f8 net/tls: Add force_resync for driver resync
This patch adds a field to the tls rx offload context which enables
drivers to force a send_resync call.

This field can be used by drivers to request a resync at the next
possible tls record. It is beneficial for hardware that provides the
resync sequence number asynchronously. In such cases, the packet that
triggered the resync does not contain the information required for a
resync. Instead, the driver requests resync for all the following
TLS record until the asynchronous notification with the resync request
TCP sequence arrives.

A following series for mlx5e ConnectX-6DX TLS RX offload support will
use this mechanism.

Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-27 15:07:13 -07:00
Pablo Neira Ayuso
abadb2f865 netfilter: nf_tables: delete devices from flowtable
This patch allows users to delete devices from existing flowtables.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Pablo Neira Ayuso
78d9f48f7f netfilter: nf_tables: add devices to existing flowtable
This patch allows users to add devices to an existing flowtable.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Romain Bellan
cb8aa9a3af netfilter: ctnetlink: add kernel side filtering for dump
Conntrack dump does not support kernel side filtering (only get exists,
but it returns only one entry. And user has to give a full valid tuple)

It means that userspace has to implement filtering after receiving many
irrelevant entries, consuming resources (conntrack table is sometimes
very huge, much more than a routing table for example).

This patch adds filtering in kernel side. To achieve this goal, we:

 * Add a new CTA_FILTER netlink attributes, actually a flag list to
   parametize filtering
 * Convert some *nlattr_to_tuple() functions, to allow a partial parsing
   of CTA_TUPLE_ORIG and CTA_TUPLE_REPLY (so nf_conntrack_tuple it not
   fully set)

Filtering is now possible on:
 * IP SRC/DST values
 * Ports for TCP and UDP flows
 * IMCP(v6) codes types and IDs

Filtering is done as an "AND" operator. For example, when flags
PROTO_SRC_PORT, PROTO_NUM and IP_SRC are sets, only entries matching all
values are dumped.

Changes since v1:
  Set NLM_F_DUMP_FILTERED in nlm flags if entries are filtered

Changes since v2:
  Move several constants to nf_internals.h
  Move a fix on netlink values check in a separate patch
  Add a check on not-supported flags
  Return EOPNOTSUPP if CDA_FILTER is set in ctnetlink_flush_conntrack
  (not yet implemented)
  Code style issues

Changes since v3:
  Fix compilation warning reported by kbuild test robot

Changes since v4:
  Fix a regression introduced in v3 (returned EINVAL for valid netlink
  messages without CTA_MARK)

Changes since v5:
  Change definition of CTA_FILTER_F_ALL
  Fix a regression when CTA_TUPLE_ZONE is not set

Signed-off-by: Romain Bellan <romain.bellan@wifirst.fr>
Signed-off-by: Florent Fourcot <florent.fourcot@wifirst.fr>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-27 22:20:34 +02:00
Tamizh Chelvam
9a5f648862 nl80211: Add support to configure TID specific Tx rate configuration
This patch adds support to configure per TID Tx Rate configuration
through NL80211_TID_CONFIG_ATTR_TX_RATE* attributes. And it uses
nl80211_parse_tx_bitrate_mask api to validate the Tx rate mask.

Signed-off-by: Tamizh Chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1589357504-10175-1-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:03:25 +02:00
Markus Theil
dca9ca2d58 nl80211: add ability to report TX status for control port TX
This adds the necessary capabilities in nl80211 to allow drivers to
assign a cookie to control port TX frames (returned via extack in
the netlink ACK message of the command) and then later report the
frame's status.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200508144202.7678-2-markus.theil@tu-ilmenau.de
[use extack cookie instead of explicit message, recombine patches]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:04 +02:00
Gustavo A. R. Silva
3c23215ba8 mac80211: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20200507185907.GA15102@embeddedor
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:04 +02:00
Gustavo A. R. Silva
396fba0a59 cfg80211: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Link: https://lore.kernel.org/r/20200507183909.GA12993@embeddedor
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:03 +02:00
Thomas Pedersen
e76fede8bf cfg80211: add KHz variants of frame RX API
Drivers may wish to report the RX frequency in units of
KHz. Provide cfg80211_rx_mgmt_khz() and wrap it with
cfg80211_rx_mgmt() so exisiting drivers which can't report
KHz anyway don't need to change. Add a similar wrapper for
cfg80211_report_obss_beacon() so the frequency units stay
somewhat consistent.

This doesn't actually change the nl80211 API yet.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200430172554.18383-2-thomas@adapt-ip.com
[fix mac80211 calling the non-khz version of obss beacon report,
 drop trace point name changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:02 +02:00
Sergey Matyukevich
33462e6823 cfg80211: add support for TID specific AMSDU configuration
This patch adds support to control per TID MSDU aggregation
using the NL80211_TID_CONFIG_ATTR_AMSDU_CTRL attribute.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200424112905.26770-4-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:01 +02:00
Sergey Matyukevich
2d5d9b7ff4 cfg80211: fix mask type in cfg80211_tid_cfg structure
TIDs mask type is u64 in wiphy settings and nl80211 processing, see:
- wiphy TIDs mask sizes in tid_config_support structure
- prepare driver command in parse_tid_conf

Use the same type for TIDs mask in cfg80211_tid_cfg.

Signed-off-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200424112905.26770-2-sergey.matyukevich.os@quantenna.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-05-27 10:02:00 +02:00
David S. Miller
745bd6f44c One batch of changes, containing:
* hwsim improvements from Jouni and myself, to be able to
    test more scenarios easily
  * some more HE (802.11ax) support
  * some initial S1G (sub 1 GHz) work for fractional MHz channels
  * some (action) frame registration updates to help DPP support
  * along with other various improvements/fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl7L1A8ACgkQB8qZga/f
 l8T5RQ/9EUxFqBs0cWojwyad5nkesyl51eOnbvSCJJF14W93s2oMeikCynTPe8Vg
 km36041QZqGbwmU0yWC9Lmm4y3ja5qQGI+QW+vT6tutGQx6FgK5TzUfYXqiFZqf6
 asqkvHpH4VqmbG1KEp0PZjIpW/OVK96pbvtXVnkrcMmjl2JjbRtAhyZQVNtt9ufJ
 6wqKf8e6iYqMIInMFPLX+rl7UEknxDKVcqPbMMJmY8/iM1z9Elkg3rkRSMehC+mE
 8cznZ6BsjAGCbMiA8K9fUo15lcMfZCJ1hAPzkD4TsJtMEJ0gYDo5jDB8TIpr5uoL
 95OnlF8jokJIsO+1g4CyaNSQsmFIuDo84vW8LtGRu9qzTP0UwelxhjZLgE3xlP6b
 W+z5HomxfWkYhJhaNywLP3B1VPtJwX8dL/wpECOWHzNKXG7Rb6GqzUwaCRFb6Jjo
 TmFJ5wLoEZHhsXYO2dvcyTzCUCXviXvfq60a56IyCJN8wDqmcubePv0+NOHUmj3c
 E71NTYymM3j9agdSpXdCFLBXA1OgyIydeSNHuBlaPA4sK6tr4ikUtbOrABjYTaQz
 2BB5fHEi8gs4EiHbSXqLFBot3JHljKJPsSN0wAgzQffN+6Kts9FG6HcrLsL+duDg
 lRdAzRrunE85S0QhsxeVIX216rX4W08sl0B1rJR+dTMX9ByblAk=
 =MVBJ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-04-25' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
One batch of changes, containing:
 * hwsim improvements from Jouni and myself, to be able to
   test more scenarios easily
 * some more HE (802.11ax) support
 * some initial S1G (sub 1 GHz) work for fractional MHz channels
 * some (action) frame registration updates to help DPP support
 * along with other various improvements/fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 20:17:35 -07:00
David Ahern
1fd1c768f3 ipv4: nexthop version of fib_info_nh_uses_dev
Similar to the last path, need to fix fib_info_nh_uses_dev for
external nexthops to avoid referencing multiple nh_grp structs.
Move the device check in fib_info_nh_uses_dev to a helper and
create a nexthop version that is called if the fib_info uses an
external nexthop.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:07 -07:00
David Ahern
af7888ad9e ipv4: Refactor nhc evaluation in fib_table_lookup
FIB lookups can return an entry that references an external nexthop.
While walking the nexthop struct we do not want to make multiple calls
into the nexthop code which can result in 2 different structs getting
accessed - one returning the number of paths the rest of the loop
seeing a different nh_grp struct. If the nexthop group shrunk, the
result is an attempt to access a fib_nh_common that does not exist for
the new nh_grp struct but did for the old one.

To fix that move the device evaluation code to a helper that can be
used for inline fib_nh path as well as external nexthops.

Update the existing check for fi->nh in fib_table_lookup to call a
new helper, nexthop_get_nhc_lookup, which walks the external nexthop
with a single rcu dereference.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:07 -07:00
David Ahern
0b5e2e3973 nexthop: Expand nexthop_is_multipath in a few places
I got too fancy consolidating checks on multipath type. The result
is that path lookups can access 2 different nh_grp structs as exposed
by Nik's torture tests. Expand nexthop_is_multipath within nexthop.h to
avoid multiple, nh_grp dereferences and make decisions based on the
consistent struct.

Only 2 places left using nexthop_is_multipath are within IPv6, both
only check that the nexthop is a multipath for a branching decision
which are acceptable.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:07 -07:00
Nikolay Aleksandrov
90f33bffa3 nexthops: don't modify published nexthop groups
We must avoid modifying published nexthop groups while they might be
in use, otherwise we might see NULL ptr dereferences. In order to do
that we allocate 2 nexthoup group structures upon nexthop creation
and swap between them when we have to delete an entry. The reason is
that we can't fail nexthop group removal, so we can't handle allocation
failure thus we move the extra allocation on creation where we can
safely fail and return ENOMEM.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 16:06:06 -07:00
Guillaume Nault
58cff782cc flow_dissector: Parse multiple MPLS Label Stack Entries
The current MPLS dissector only parses the first MPLS Label Stack
Entry (second LSE can be parsed too, but only to set a key_id).

This patch adds the possibility to parse several LSEs by making
__skb_flow_dissect_mpls() return FLOW_DISSECT_RET_PROTO_AGAIN as long
as the Bottom Of Stack bit hasn't been seen, up to a maximum of
FLOW_DIS_MPLS_MAX entries.

FLOW_DIS_MPLS_MAX is arbitrarily set to 7. This should be enough for
many practical purposes, without wasting too much space.

To record the parsed values, flow_dissector_key_mpls is modified to
store an array of stack entries, instead of just the values of the
first one. A bit field, "used_lses", is also added to keep track of
the LSEs that have been set. The objective is to avoid defining a
new FLOW_DISSECTOR_KEY_MPLS_XX for each level of the MPLS stack.

TC flower is adapted for the new struct flow_dissector_key_mpls layout.
Matching on several MPLS Label Stack Entries will be added in the next
patch.

The NFP and MLX5 drivers are also adapted: nfp_flower_compile_mac() and
mlx5's parse_tunnel() now verify that the rule only uses the first LSE
and fail if it doesn't.

Finally, the behaviour of the FLOW_DISSECTOR_KEY_MPLS_ENTROPY key is
slightly modified. Instead of recording the first Entropy Label, it
now records the last one. This shouldn't have any consequences since
there doesn't seem to have any user of FLOW_DISSECTOR_KEY_MPLS_ENTROPY
in the tree. We'd probably better do a hash of all parsed MPLS labels
instead (excluding reserved labels) anyway. That'd give better entropy
and would probably also simplify the code. But that's not the purpose
of this patch, so I'm keeping that as a future possible improvement.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-26 15:22:58 -07:00
Vinay Kumar Yadav
0cada33241 net/tls: fix race condition causing kernel panic
tls_sw_recvmsg() and tls_decrypt_done() can be run concurrently.
// tls_sw_recvmsg()
	if (atomic_read(&ctx->decrypt_pending))
		crypto_wait_req(-EINPROGRESS, &ctx->async_wait);
	else
		reinit_completion(&ctx->async_wait.completion);

//tls_decrypt_done()
  	pending = atomic_dec_return(&ctx->decrypt_pending);

  	if (!pending && READ_ONCE(ctx->async_notify))
  		complete(&ctx->async_wait.completion);

Consider the scenario tls_decrypt_done() is about to run complete()

	if (!pending && READ_ONCE(ctx->async_notify))

and tls_sw_recvmsg() reads decrypt_pending == 0, does reinit_completion(),
then tls_decrypt_done() runs complete(). This sequence of execution
results in wrong completion. Consequently, for next decrypt request,
it will not wait for completion, eventually on connection close, crypto
resources freed, there is no way to handle pending decrypt response.

This race condition can be avoided by having atomic_read() mutually
exclusive with atomic_dec_return(),complete().Intoduced spin lock to
ensure the mutual exclution.

Addressed similar problem in tx direction.

v1->v2:
- More readable commit message.
- Corrected the lock to fix new race scenario.
- Removed barrier which is not needed now.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Vinay Kumar Yadav <vinay.yadav@chelsio.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-25 17:41:40 -07:00
David S. Miller
13209a8f73 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The MSCC bug fix in 'net' had to be slightly adjusted because the
register accesses are done slightly differently in net-next.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-24 13:47:27 -07:00
David S. Miller
46c54f9500 mlx5-updates-2020-05-22
This series includes two updates and one cleanup patch
 
 1) Tang Bim, clean-up with IS_ERR() usage
 
 2) Vlad introduces a new mlx5 kconfig flag for TC support
 
    This is required due to the high volume of current and upcoming
    development in the eswitch and representors areas where some of the
    feature are TC based such as the downstream patches of MPLSoUDP and
    the following representor bonding support for VF live migration and
    uplink representor dynamic loading.
    For this Vlad kept TC specific code in tc.c and rep/tc.c and
    organized non TC code in representors specific files.
 
 3) Eli Cohen adds support for MPLS over UPD encap and decap TC offloads.
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl7IZFEACgkQSD+KveBX
 +j7IEQf/RFv633bWTlL63fEJjViRv1rjfkbyaXrGVL3gzr/Er01DeAPR22CNOlC3
 bu1jHLKqVn0Mg0g5g2B4/H/7JoFbMBRTy4MXpM5VrQCIqwMuXG4zhWuoUj7ncQ5w
 kXHAU6DUuZRn8/x1JLQOHDRTzKhav7ldT+nvvoKEMrad/DEMGz+bq67xh4l8nfi+
 ktSFAO0UFi9ysb25CMfdqIqAL0J5nAJ7DNhw5x7IvtwUxNxate7HtBaBhBgZ9NWv
 jYf8R3p+7JdgvVW18pZhmjbaBqaApXcZrC7rI07PR6rCOAHfToX6miR8gUtpIEno
 itQkzYt9UF2dgNwMmxoJLqnUNiy/Cg==
 =wkSR
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2020-05-22' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2020-05-22

This series includes two updates and one cleanup patch

1) Tang Bim, clean-up with IS_ERR() usage

2) Vlad introduces a new mlx5 kconfig flag for TC support

   This is required due to the high volume of current and upcoming
   development in the eswitch and representors areas where some of the
   feature are TC based such as the downstream patches of MPLSoUDP and
   the following representor bonding support for VF live migration and
   uplink representor dynamic loading.
   For this Vlad kept TC specific code in tc.c and rep/tc.c and
   organized non TC code in representors specific files.

3) Eli Cohen adds support for MPLS over UPD encap and decap TC offloads.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-23 16:37:00 -07:00
David S. Miller
a152b85984 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-05-23

The following pull-request contains BPF updates for your *net-next* tree.

We've added 50 non-merge commits during the last 8 day(s) which contain
a total of 109 files changed, 2776 insertions(+), 2887 deletions(-).

The main changes are:

1) Add a new AF_XDP buffer allocation API to the core in order to help
   lowering the bar for drivers adopting AF_XDP support. i40e, ice, ixgbe
   as well as mlx5 have been moved over to the new API and also gained a
   small improvement in performance, from Björn Töpel and Magnus Karlsson.

2) Add getpeername()/getsockname() attach types for BPF sock_addr programs
   in order to allow for e.g. reverse translation of load-balancer backend
   to service address/port tuple from a connected peer, from Daniel Borkmann.

3) Improve the BPF verifier is_branch_taken() logic to evaluate pointers
   being non-NULL, e.g. if after an initial test another non-NULL test on
   that pointer follows in a given path, then it can be pruned right away,
   from John Fastabend.

4) Larger rework of BPF sockmap selftests to make output easier to understand
   and to reduce overall runtime as well as adding new BPF kTLS selftests
   that run in combination with sockmap, also from John Fastabend.

5) Batch of misc updates to BPF selftests including fixing up test_align
   to match verifier output again and moving it under test_progs, allowing
   bpf_iter selftest to compile on machines with older vmlinux.h, and
   updating config options for lirc and v6 segment routing helpers, from
   Stanislav Fomichev, Andrii Nakryiko and Alan Maguire.

6) Conversion of BPF tracing samples outdated internal BPF loader to use
   libbpf API instead, from Daniel T. Lee.

7) Follow-up to BPF kernel test infrastructure in order to fix a flake in
   the XDP selftests, from Jesper Dangaard Brouer.

8) Minor improvements to libbpf's internal hashmap implementation, from
   Ian Rogers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 18:30:34 -07:00
Eli Cohen
d43600c01c net: Add netif_is_bareudp() API to identify bareudp devices
Add netif_is_bareudp() so the device can be identified as a bareudp one.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-22 16:46:16 -07:00
Horatiu Vultur
89c1e111cb switchdev: mrp: Remove the variable mrp_ring_state
Remove the variable mrp_ring_state from switchdev_attr because is not
used anywhere.
The ring state is set using SWITCHDEV_OBJ_ID_RING_STATE_MRP.

Fixes: c284b54590 ("switchdev: mrp: Extend switchdev API to offload MRP")
Acked-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 16:17:15 -07:00
David S. Miller
4629ed2e48 RxRPC fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAl7FQH4ACgkQ+7dXa6fL
 C2u05BAAlilXCYGZU23/yQmMxxkmcG+jW+oDV9ySNDl6iJT+OtKxHEGReDD4D2f0
 rhaBivgGcOnZy89AGjNrSVROGlwSBXl7ArJcjsfsx8AuNzxHUHQKWlW/k8n87qEt
 NCTze7f65IT6NowgYAFgJn5kIpY/9iKuNiCf6NGL3Z35wqxPvwNs6AQSGM495uvB
 el/ddkr8QzzjI9Ejsgzj94x4DAOjk4T4WzfWMAgyr1OEqz6vKNKkCwSKPySOsQAK
 72JRaGhWA9rfAOkA7nAZpnjHdfFYnkFBOVQzmswOJYRYe3D/QY5D9PUlGIQ5OSjL
 yV5YOi/+AUrSif79NfEYXga0r/NFJMFqBg2zo/eiSrhfZZFZMDcagnGhzpGjbYF1
 IaeIu4q/MQOQybi8m1GJhvFfPOhdKRn731jlsUvEoxK0TonSu/u64eus+qelQxOd
 uiIcu/kLxfPZSznUd8cXZ+Pffce0uBIRWq0nRQZ703TyHY+/gYo7ZGHr/FZNKaK4
 lRNP4Nu3goLQCI40R7y7USnpX+kWfd4mYC9zl+VBSXG1JymYbOezXYrNATBCqpo6
 9VoYtqDdo8ESksFUBqM7fGRDZ20nah6KdRGmnrPU+rpODHZEZmNN7D/rayU31wua
 VIbVw1WluvSVnQ8+b1BwJwvhJQ4CazyGcnbxDqx1zd07EjUiiJI=
 =jJvy
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-fixes-20200520' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Fix retransmission timeout and ACK discard

Here are a couple of fixes and an extra tracepoint for AF_RXRPC:

 (1) Calculate the RTO pretty much as TCP does, rather than making
     something up, including an initial 4s timeout (which causes return
     probes from the fileserver to fail if a packet goes missing), and add
     backoff.

 (2) Fix the discarding of out-of-order received ACKs.  We mustn't let the
     hard-ACK point regress, nor do we want to do unnecessary
     retransmission because the soft-ACK list regresses.  This is not
     trivial, however, due to some loose wording in various old protocol
     specs, the ACK field that should be used for this sometimes has the
     wrong information in it.

 (3) Add a tracepoint to log a discarded ACK.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:53:30 -07:00
Edward Cree
060b6381ef net: flow_offload: simplify hw stats check handling
Make FLOW_ACTION_HW_STATS_DONT_CARE be all bits, rather than none, so that
 drivers and __flow_action_hw_stats_check can use simple bitwise checks.

Pre-fill all actions with DONT_CARE in flow_rule_alloc(), rather than
 relying on implicit semantics of zero from kzalloc, so that callers which
 don't configure action stats themselves (i.e. netfilter) get the correct
 behaviour by default.

Only the kernel's internal API semantics change; the TC uAPI is unaffected.

v4: move DONT_CARE setting to flow_rule_alloc() for robustness and simplicity.

v3: set DONT_CARE in nft and ct offload.

v2: rebased on net-next, removed RFC tags.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 15:52:08 -07:00
Roopa Prabhu
8590ceedb7 nexthop: add support for notifiers
This patch adds nexthop add/del notifiers. To be used by
vxlan driver in a later patch. Could possibly be used by
switchdev drivers in the future.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:00:38 -07:00
Roopa Prabhu
1274e1cc42 vxlan: ecmp support for mac fdb entries
Todays vxlan mac fdb entries can point to multiple remote
ips (rdsts) with the sole purpose of replicating
broadcast-multicast and unknown unicast packets to those remote ips.

E-VPN multihoming [1,2,3] requires bridged vxlan traffic to be
load balanced to remote switches (vteps) belonging to the
same multi-homed ethernet segment (E-VPN multihoming is analogous
to multi-homed LAG implementations, but with the inter-switch
peerlink replaced with a vxlan tunnel). In other words it needs
support for mac ecmp. Furthermore, for faster convergence, E-VPN
multihoming needs the ability to update fdb ecmp nexthops independent
of the fdb entries.

New route nexthop API is perfect for this usecase.
This patch extends the vxlan fdb code to take a nexthop id
pointing to an ecmp nexthop group.

Changes include:
- New NDA_NH_ID attribute for fdbs
- Use the newly added fdb nexthop groups
- makes vxlan rdsts and nexthop handling code mutually
  exclusive
- since this is a new use-case and the requirement is for ecmp
nexthop groups, the fdb add and update path checks that the
nexthop is really an ecmp nexthop group. This check can be relaxed
in the future, if we want to introduce replication fdb nexthop groups
and allow its use in lieu of current rdst lists.
- fdb update requests with nexthop id's only allowed for existing
fdb's that have nexthop id's
- learning will not override an existing fdb entry with nexthop
group
- I have wrapped the switchdev offload code around the presence of
rdst

[1] E-VPN RFC https://tools.ietf.org/html/rfc7432
[2] E-VPN with vxlan https://tools.ietf.org/html/rfc8365
[3] http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf

Includes a null check fix in vxlan_xmit from Nikolay

v2 - Fixed build issue:
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:00:38 -07:00
Roopa Prabhu
38428d6871 nexthop: support for fdb ecmp nexthops
This patch introduces ecmp nexthops and nexthop groups
for mac fdb entries. In subsequent patches this is used
by the vxlan driver fdb entries. The use case is
E-VPN multihoming [1,2,3] which requires bridged vxlan traffic
to be load balanced to remote switches (vteps) belonging to
the same multi-homed ethernet segment (This is analogous to
a multi-homed LAG but over vxlan).

Changes include new nexthop flag NHA_FDB for nexthops
referenced by fdb entries. These nexthops only have ip.
This patch includes appropriate checks to avoid routes
referencing such nexthops.

example:
$ip nexthop add id 12 via 172.16.1.2 fdb
$ip nexthop add id 13 via 172.16.1.3 fdb
$ip nexthop add id 102 group 12/13 fdb

$bridge fdb add 02:02:00:00:00:13 dev vxlan1000 nhid 101 self

[1] E-VPN https://tools.ietf.org/html/rfc7432
[2] E-VPN VxLAN: https://tools.ietf.org/html/rfc8365
[3] LPC talk with mention of nexthop groups for L2 ecmp
http://vger.kernel.org/lpc_net2018_talks/scaling_bridge_fdb_database_slidesV3.pdf

v4 - fixed uninitialized variable reported by kernel test robot
Reported-by: kernel test robot <rong.a.chen@intel.com>

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-22 14:00:38 -07:00
Björn Töpel
26062b185e xsk: Explicitly inline functions and move definitions
In order to reduce the number of function calls, the struct
xsk_buff_pool definition is moved to xsk_buff_pool.h. The functions
xp_get_dma(), xp_dma_sync_for_cpu(), xp_dma_sync_for_device(),
xp_validate_desc() and various helper functions are explicitly
inlined.

Further, move xp_get_handle() and xp_release() to xsk.c, to allow for
the compiler to perform inlining.

rfc->v1: Make sure xp_validate_desc() is inlined for Tx perf. (Maxim)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-15-bjorn.topel@gmail.com
2020-05-21 17:31:27 -07:00
Björn Töpel
0807892ecb xsk: Remove MEM_TYPE_ZERO_COPY and corresponding code
There are no users of MEM_TYPE_ZERO_COPY. Remove all corresponding
code, including the "handle" member of struct xdp_buff.

rfc->v1: Fixed spelling in commit message. (Björn)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-13-bjorn.topel@gmail.com
2020-05-21 17:31:27 -07:00
Björn Töpel
2b43470add xsk: Introduce AF_XDP buffer allocation API
In order to simplify AF_XDP zero-copy enablement for NIC driver
developers, a new AF_XDP buffer allocation API is added. The
implementation is based on a single core (single producer/consumer)
buffer pool for the AF_XDP UMEM.

A buffer is allocated using the xsk_buff_alloc() function, and
returned using xsk_buff_free(). If a buffer is disassociated with the
pool, e.g. when a buffer is passed to an AF_XDP socket, a buffer is
said to be released. Currently, the release function is only used by
the AF_XDP internals and not visible to the driver.

Drivers using this API should register the XDP memory model with the
new MEM_TYPE_XSK_BUFF_POOL type.

The API is defined in net/xdp_sock_drv.h.

The buffer type is struct xdp_buff, and follows the lifetime of
regular xdp_buffs, i.e.  the lifetime of an xdp_buff is restricted to
a NAPI context. In other words, the API is not replacing xdp_frames.

In addition to introducing the API and implementations, the AF_XDP
core is migrated to use the new APIs.

rfc->v1: Fixed build errors/warnings for m68k and riscv. (kbuild test
         robot)
         Added headroom/chunk size getter. (Maxim/Björn)

v1->v2: Swapped SoBs. (Maxim)

v2->v3: Initialize struct xdp_buff member frame_sz. (Björn)
        Add API to query the DMA address of a frame. (Maxim)
        Do DMA sync for CPU till the end of the frame to handle
        possible growth (frame_sz). (Maxim)

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-6-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Björn Töpel
89e4a376e3 xsk: Move defines only used by AF_XDP internals to xsk.h
Move the XSK_NEXT_PG_CONTIG_{MASK,SHIFT}, and
XDP_UMEM_USES_NEED_WAKEUP defines from xdp_sock.h to the AF_XDP
internal xsk.h file. Also, start using the BIT{,_ULL} macro instead of
explicit shifts.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-5-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Magnus Karlsson
a71506a4fd xsk: Move driver interface to xdp_sock_drv.h
Move the AF_XDP zero-copy driver interface to its own include file
called xdp_sock_drv.h. This, hopefully, will make it more clear for
NIC driver implementors to know what functions to use for zero-copy
support.

v4->v5: Fix -Wmissing-prototypes by include header file. (Jakub)

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-4-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Björn Töpel
d20a1676df xsk: Move xskmap.c to net/xdp/
The XSKMAP is partly implemented by net/xdp/xsk.c. Move xskmap.c from
kernel/bpf/ to net/xdp/, which is the logical place for AF_XDP related
code. Also, move AF_XDP struct definitions, and function declarations
only used by AF_XDP internals into net/xdp/xsk.h.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-3-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Björn Töpel
44ac082b30 xsk: Fix xsk_umem_xdp_frame_sz()
Calculating the "data_hard_end" for an XDP buffer coming from AF_XDP
zero-copy mode, the return value of xsk_umem_xdp_frame_sz() is added
to "data_hard_start".

Currently, the chunk size of the UMEM is returned by
xsk_umem_xdp_frame_sz(). This is not correct, if the fixed UMEM
headroom is non-zero. Fix this by returning the chunk_size without the
UMEM headroom.

Fixes: 2a637c5b1a ("xdp: For Intel AF_XDP drivers add XDP frame_sz")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/20200520192103.355233-2-bjorn.topel@gmail.com
2020-05-21 17:31:26 -07:00
Sabrina Dubroca
41b4bd986f net: don't return invalid table id error when we fall back to PF_UNSPEC
In case we can't find a ->dumpit callback for the requested
(family,type) pair, we fall back to (PF_UNSPEC,type). In effect, we're
in the same situation as if userspace had requested a PF_UNSPEC
dump. For RTM_GETROUTE, that handler is rtnl_dump_all, which calls all
the registered RTM_GETROUTE handlers.

The requested table id may or may not exist for all of those
families. commit ae677bbb44 ("net: Don't return invalid table id
error when dumping all families") fixed the problem when userspace
explicitly requests a PF_UNSPEC dump, but missed the fallback case.

For example, when we pass ipv6.disable=1 to a kernel with
CONFIG_IP_MROUTE=y and CONFIG_IP_MROUTE_MULTIPLE_TABLES=y,
the (PF_INET6, RTM_GETROUTE) handler isn't registered, so we end up in
rtnl_dump_all, and listing IPv6 routes will unexpectedly print:

  # ip -6 r
  Error: ipv4: MR table does not exist.
  Dump terminated

commit ae677bbb44 introduced the dump_all_families variable, which
gets set when userspace requests a PF_UNSPEC dump. However, we can't
simply set the family to PF_UNSPEC in rtnetlink_rcv_msg in the
fallback case to get dump_all_families == true, because some messages
types (for example RTM_GETRULE and RTM_GETNEIGH) only register the
PF_UNSPEC handler and use the family to filter in the kernel what is
dumped to userspace. We would then export more entries, that userspace
would have to filter. iproute does that, but other programs may not.

Instead, this patch removes dump_all_families and updates the
RTM_GETROUTE handlers to check if the family that is being dumped is
their own. When it's not, which covers both the intentional PF_UNSPEC
dumps (as dump_all_families did) and the fallback case, ignore the
missing table id error.

Fixes: cb167893f4 ("net: Plumb support for filtering ipv4 and ipv6 multicast route dumps")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-21 17:25:50 -07:00
Al Viro
bbced07d99 get rid of compat_mc_setsockopt()
not used anymore

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:32 -04:00
Al Viro
d59eb177c8 ip6_mc_msfilter(): pass the address list separately
that way we'll be able to reuse it for compat case

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:29 -04:00
Al Viro
0dfe6581a7 get rid of compat_mc_getsockopt()
now we can do MCAST_MSFILTER in compat ->getsockopt() without
playing silly buggers with copying things back and forth.
We can form a native struct group_filter (sans the variable-length
tail) on stack, pass that + pointer to the tail of original request
to the helper doing the bulk of the work, then do the rest of
copyout - same as the native getsockopt() does.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:27 -04:00
Al Viro
931ca7ab7f ip*_mc_gsfget(): lift copyout of struct group_filter into callers
pass the userland pointer to the array in its tail, so that part
gets copied out by our functions; copyout of everything else is
done in the callers.  Rationale: reuse for compat; the array
is the same in native and compat, the layout of parts before it
is different for compat.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:27 -04:00
Al Viro
63287de66d lift compat definitions of mcast [sg]etsockopt requests into net/compat.h
We want to get rid of compat_mc_[sg]etsockopt() and to have that stuff
handled without compat_alloc_user_space(), extra copying through
userland, etc.  To do that we'll need ipv4 and ipv6 instances of
->compat_[sg]etsockopt() to manipulate the 32bit variants of mcast
requests, so we need to move the definitions of those out of net/compat.c
and into a public header.

This patch just does a mechanical move to include/net/compat.h

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2020-05-20 20:31:25 -04:00
Luiz Augusto von Dentz
3ca44c16b0 Bluetooth: Consolidate encryption handling in hci_encrypt_cfm
This makes hci_encrypt_cfm calls hci_connect_cfm in case the connection
state is BT_CONFIG so callers don't have to check the state.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-20 16:30:33 +02:00
Christoph Hellwig
607259a695 net: add a new ndo_tunnel_ioctl method
This method is used to properly allow kernel callers of the IPv4 route
management ioctls.  The exsting ip_tunnel_ioctl helper is renamed to
ip_tunnel_ctl to better reflect that it doesn't directly implement ioctls
touching user memory, and is used for the guts of ndo_tunnel_ctl
implementations. A new ip_tunnel_ioctl helper is added that can be wired
up directly to the ndo_do_ioctl method and takes care of the copy to and
from userspace.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 15:45:11 -07:00
Julian Wiedmann
e9a36ca5f6 net/af_iucv: clean up function prototypes
Remove a bunch of forward declarations (trivially shifting code around
where needed), and make a few functions static.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-19 12:50:14 -07:00
Christoph Hellwig
dc13c8761c ipv4,appletalk: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl
To prepare removing the global routing_ioctl hack start lifting the code
into the ipv4 and appletalk ->compat_ioctl handlers.  Unlike the existing
handler we don't bother copying in the name - there are no compat issues for
char arrays.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Christoph Hellwig
3986912f6a ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctl
To prepare removing the global routing_ioctl hack start lifting the code
into a newly added ipv6 ->compat_ioctl handler.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Christoph Hellwig
7c1552da90 ipv6: lift copy_from_user out of ipv6_route_ioctl
Prepare for better compat ioctl handling by moving the user copy out
of ipv6_route_ioctl.

Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:35:02 -07:00
Roman Mashak
b15e62631c net sched: fix reporting the first-time use timestamp
When a new action is installed, firstuse field of 'tcf_t' is explicitly set
to 0. Value of zero means "new action, not yet used"; as a packet hits the
action, 'firstuse' is stamped with the current jiffies value.

tcf_tm_dump() should return 0 for firstuse if action has not yet been hit.

Fixes: 48d8ee1694 ("net sched actions: aggregate dumping of actions timeinfo")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Roman Mashak <mrv@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-18 17:32:19 -07:00
Gustavo A. R. Silva
631512f868 Bluetooth: L2CAP: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-18 09:57:41 +02:00
Christoph Paasch
a0c1d0eafd mptcp: Use 32-bit DATA_ACK when possible
RFC8684 allows to send 32-bit DATA_ACKs as long as the peer is not
sending 64-bit data-sequence numbers. The 64-bit DSN is only there for
extreme scenarios when a very high throughput subflow is combined with a
long-RTT subflow such that the high-throughput subflow wraps around the
32-bit sequence number space within an RTT of the high-RTT subflow.

It is thus a rare scenario and we should try to use the 32-bit DATA_ACK
instead as long as possible. It allows to reduce the TCP-option overhead
by 4 bytes, thus makes space for an additional SACK-block. It also makes
tcpdumps much easier to read when the DSN and DATA_ACK are both either
32 or 64-bit.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-16 13:51:10 -07:00
David S. Miller
da07f52d3c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Move the bpf verifier trace check into the new switch statement in
HEAD.

Resolve the overlapping changes in hinic, where bug fixes overlap
the addition of VF support.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 13:48:59 -07:00
Paolo Abeni
2f8a397d0a inet_connection_sock: factor out destroy helper.
Move the steps to prepare an inet_connection_sock for
forced disposal inside a separate helper. No functional
changes inteded, this will just simplify the next patch.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
Paolo Abeni
90bf45134d mptcp: add new sock flag to deal with join subflows
MP_JOIN subflows must not land into the accept queue.
Currently tcp_check_req() calls an mptcp specific helper
to detect such scenario.

Such helper leverages the subflow context to check for
MP_JOIN subflows. We need to deal also with MP JOIN
failures, even when the subflow context is not available
due allocation failure.

A possible solution would be changing the syn_recv_sock()
signature to allow returning a more descriptive action/
error code and deal with that in tcp_check_req().

Since the above need is MPTCP specific, this patch instead
uses a TCP request socket hole to add a MPTCP specific flag.
Such flag is used by the MPTCP syn_recv_sock() to tell
tcp_check_req() how to deal with the request socket.

This change is a no-op for !MPTCP build, and makes the
MPTCP code simpler. It allows also the next patch to deal
correctly with MP JOIN failure.

v1 -> v2:
 - be more conservative on drop_req initialization (Mat)

RFC -> v1:
 - move the drop_req bit inside tcp_request_sock (Eric)

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Reviewed-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 12:30:13 -07:00
David S. Miller
3430223d39 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-15

The following pull-request contains BPF updates for your *net-next* tree.

We've added 37 non-merge commits during the last 1 day(s) which contain
a total of 67 files changed, 741 insertions(+), 252 deletions(-).

The main changes are:

1) bpf_xdp_adjust_tail() now allows to grow the tail as well, from Jesper.

2) bpftool can probe CONFIG_HZ, from Daniel.

3) CAP_BPF is introduced to isolate user processes that use BPF infra and
   to secure BPF networking services by dropping CAP_SYS_ADMIN requirement
   in certain cases, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:43:52 -07:00
Vlad Buslov
ca44b738e5 net: sched: implement terse dump support in act
Extend tcf_action_dump() with boolean argument 'terse' that is used to
request terse-mode action dump. In terse mode only essential data needed to
identify particular action (action kind, cookie, etc.) and its stats is put
to resulting skb and everything else is omitted. Implement
tcf_exts_terse_dump() helper in cls API that is intended to be used to
request terse dump of all exts (actions) attached to the filter.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:23:11 -07:00
Vlad Buslov
f8ab1807a9 net: sched: introduce terse dump flag
Add new TCA_DUMP_FLAGS attribute and use it in cls API to request terse
filter output from classifiers with TCA_DUMP_FLAGS_TERSE flag. This option
is intended to be used to improve performance of TC filter dump when
userland only needs to obtain stats and not the whole classifier/action
data. Extend struct tcf_proto_ops with new terse_dump() callback that must
be defined by supporting classifier implementations.

Support of the options in specific classifiers and actions is
implemented in following patches in the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-15 10:23:11 -07:00
Jesper Dangaard Brouer
2a637c5b1a xdp: For Intel AF_XDP drivers add XDP frame_sz
Intel drivers implement native AF_XDP zerocopy in separate C-files,
that have its own invocation of bpf_prog_run_xdp(). The setup of
xdp_buff is also handled in separately from normal code path.

This patch update XDP frame_sz for AF_XDP zerocopy drivers i40e, ice
and ixgbe, as the code changes needed are very similar.  Introduce a
helper function xsk_umem_xdp_frame_sz() for calculating frame size.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Cc: intel-wired-lan@lists.osuosl.org
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Link: https://lore.kernel.org/bpf/158945347511.97035.8536753731329475655.stgit@firesoul
2020-05-14 21:21:56 -07:00
Jesper Dangaard Brouer
34cc0b338a xdp: Xdp_frame add member frame_sz and handle in convert_to_xdp_frame
Use hole in struct xdp_frame, when adding member frame_sz, which keeps
same sizeof struct (32 bytes)

Drivers ixgbe and sfc had bug cases where the necessary/expected
tailroom was not reserved. This can lead to some hard to catch memory
corruption issues. Having the drivers frame_sz this can be detected when
packet length/end via xdp->data_end exceed the xdp_data_hard_end
pointer, which accounts for the reserved the tailroom.

When detecting this driver issue, simply fail the conversion with NULL,
which results in feedback to driver (failing xdp_do_redirect()) causing
driver to drop packet. Given the lack of consistent XDP stats, this can
be hard to troubleshoot. And given this is a driver bug, we want to
generate some more noise in form of a WARN stack dump (to ID the driver
code that inlined convert_to_xdp_frame).

Inlining the WARN macro is problematic, because it adds an asm
instruction (on Intel CPUs ud2) what influence instruction cache
prefetching. Thus, introduce xdp_warn and macro XDP_WARN, to avoid this
and at the same time make identifying the function and line of this
inlined function easier.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945337313.97035.10015729316710496600.stgit@firesoul
2020-05-14 21:21:54 -07:00
Jesper Dangaard Brouer
f95f0f95cf xdp: Add frame size to xdp_buff
XDP have evolved to support several frame sizes, but xdp_buff was not
updated with this information. The frame size (frame_sz) member of
xdp_buff is introduced to know the real size of the memory the frame is
delivered in.

When introducing this also make it clear that some tailroom is
reserved/required when creating SKBs using build_skb().

It would also have been an option to introduce a pointer to
data_hard_end (with reserved offset). The advantage with frame_sz is
that (like rxq) drivers only need to setup/assign this value once per
NAPI cycle. Due to XDP-generic (and some drivers) it's not possible to
store frame_sz inside xdp_rxq_info, because it's varies per packet as it
can be based/depend on packet length.

V2: nitpick: deduct -> deduce

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/158945334261.97035.555255657490688547.stgit@firesoul
2020-05-14 21:21:54 -07:00
David S. Miller
d00f26b623 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-05-14

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Merged tag 'perf-for-bpf-2020-05-06' from tip tree that includes CAP_PERFMON.

2) support for narrow loads in bpf_sock_addr progs and additional
   helpers in cg-skb progs, from Andrey.

3) bpf benchmark runner, from Andrii.

4) arm and riscv JIT optimizations, from Luke.

5) bpf iterator infrastructure, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 20:31:21 -07:00
David S. Miller
1b54f4fa4d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix gcc-10 compilation warning in nf_conntrack, from Arnd Bergmann.

2) Add NF_FLOW_HW_PENDING to avoid races between stats and deletion
   commands, from Paul Blakey.

3) Remove WQ_MEM_RECLAIM from the offload workqueue, from Roi Dayan.

4) Infinite loop when removing nf_conntrack module, from Florian Westphal.

5) Set NF_FLOW_TEARDOWN bit on expiration to avoid races when refreshing
   the timeout from the software path.

6) Missing nft_set_elem_expired() check in the rbtree, from Phil Sutter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-14 13:15:02 -07:00
Yonghong Song
3c32cc1bce bpf: Enable bpf_iter targets registering ctx argument types
Commit b121b341e5 ("bpf: Add PTR_TO_BTF_ID_OR_NULL
support") adds a field btf_id_or_null_non0_off to
bpf_prog->aux structure to indicate that the
first ctx argument is PTR_TO_BTF_ID reg_type and
all others are PTR_TO_BTF_ID_OR_NULL.
This approach does not really scale if we have
other different reg types in the future, e.g.,
a pointer to a buffer.

This patch enables bpf_iter targets registering ctx argument
reg types which may be different from the default one.
For example, for pointers to structures, the default reg_type
is PTR_TO_BTF_ID for tracing program. The target can register
a particular pointer type as PTR_TO_BTF_ID_OR_NULL which can
be used by the verifier to enforce accesses.

Signed-off-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Link: https://lore.kernel.org/bpf/20200513180221.2949882-1-yhs@fb.com
2020-05-13 12:30:50 -07:00
David S. Miller
6cd35888a0 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-05-13

Here's a second attempt at a bluetooth-next pull request which
supercedes the one dated 2020-05-09. This should have the issues
discovered by Jakub fixed.

 - Add support for Intel Typhoon Peak device (8087:0032)
 - Add device tree bindings for Realtek RTL8723BS device
 - Add device tree bindings for Qualcomm QCA9377 device
 - Add support for experimental features configuration through mgmt
 - Add driver hook to prevent wake from suspend
 - Add support for waiting for L2CAP disconnection response
 - Multiple fixes & cleanups to the btbcm driver
 - Add support for LE scatternet topology for selected devices
 - A few other smaller fixes & cleanups

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-13 12:20:12 -07:00
Abhishek Pandit-Subedi
81dafad53c Bluetooth: Add hook for driver to prevent wake from suspend
Let drivers have a hook to disable configuring scanning during suspend.
Drivers should use the device_may_wakeup function call to determine
whether hci should be configured for wakeup.

For example, an implementation for btusb may look like the following:

  bool btusb_prevent_wake(struct hci_dev *hdev)
  {
        struct btusb_data *data = hci_get_drvdata(hdev);
        return !device_may_wakeup(&data->udev->dev);
  }

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 09:12:04 +02:00
Abhishek Pandit-Subedi
0d2c9825e4 Bluetooth: Rename BT_SUSPEND_COMPLETE
Renamed BT_SUSPEND_COMPLETE to BT_SUSPEND_CONFIGURE_WAKE since it sets
up the event filter and whitelist for wake-up.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-05-13 09:12:04 +02:00
Russell King
54a0ed0df4 net: dsa: provide an option for drivers to always receive bridge VLANs
DSA assumes that a bridge which has vlan filtering disabled is not
vlan aware, and ignores all vlan configuration. However, the kernel
software bridge code allows configuration in this state.

This causes the kernel's idea of the bridge vlan state and the
hardware state to disagree, so "bridge vlan show" indicates a correct
configuration but the hardware lacks all configuration. Even worse,
enabling vlan filtering on a DSA bridge immediately blocks all traffic
which, given the output of "bridge vlan show", is very confusing.

Provide an option that drivers can set to indicate they want to receive
vlan configuration even when vlan filtering is disabled. At the very
least, this is safe for Marvell DSA bridges, which do not look up
ingress traffic in the VTU if the port is in 8021Q disabled state. It is
also safe for the Ocelot switch family. Whether this change is suitable
for all DSA bridges is not known.

Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 13:08:07 -07:00
Eric Dumazet
24adbc1676 tcp: fix SO_RCVLOWAT hangs with fat skbs
We autotune rcvbuf whenever SO_RCVLOWAT is set to account for 100%
overhead in tcp_set_rcvlowat()

This works well when skb->len/skb->truesize ratio is bigger than 0.5

But if we receive packets with small MSS, we can end up in a situation
where not enough bytes are available in the receive queue to satisfy
RCVLOWAT setting.
As our sk_rcvbuf limit is hit, we send zero windows in ACK packets,
preventing remote peer from sending more data.

Even autotuning does not help, because it only triggers at the time
user process drains the queue. If no EPOLLIN is generated, this
can not happen.

Note poll() has a similar issue, after commit
c7004482e8 ("tcp: Respect SO_RCVLOWAT in tcp_poll().")

Fixes: 03f45c883c ("tcp: avoid extra wakeups for SO_RCVLOWAT users")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-12 12:49:47 -07:00
Gustavo A. R. Silva
0fa39d6dd0 ipv6: Replace zero-length array with flexible-array
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

sizeof(flexible-array-member) triggers a warning because flexible array
members have incomplete type[1]. There are some instances of code in
which the sizeof operator is being incorrectly/erroneously applied to
zero-length arrays and the result is zero. Such instances may be hiding
some bugs. So, this work (flexible-array member conversions) will also
help to get completely rid of those sorts of issues.

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-11 13:18:54 -07:00
David Howells
c410bf0193 rxrpc: Fix the excessive initial retransmission timeout
rxrpc currently uses a fixed 4s retransmission timeout until the RTT is
sufficiently sampled.  This can cause problems with some fileservers with
calls to the cache manager in the afs filesystem being dropped from the
fileserver because a packet goes missing and the retransmission timeout is
greater than the call expiry timeout.

Fix this by:

 (1) Copying the RTT/RTO calculation code from Linux's TCP implementation
     and altering it to fit rxrpc.

 (2) Altering the various users of the RTT to make use of the new SRTT
     value.

 (3) Replacing the use of rxrpc_resend_timeout to use the calculated RTO
     value instead (which is needed in jiffies), along with a backoff.

Notes:

 (1) rxrpc provides RTT samples by matching the serial numbers on outgoing
     DATA packets that have the RXRPC_REQUEST_ACK set and PING ACK packets
     against the reference serial number in incoming REQUESTED ACK and
     PING-RESPONSE ACK packets.

 (2) Each packet that is transmitted on an rxrpc connection gets a new
     per-connection serial number, even for retransmissions, so an ACK can
     be cross-referenced to a specific trigger packet.  This allows RTT
     information to be drawn from retransmitted DATA packets also.

 (3) rxrpc maintains the RTT/RTO state on the rxrpc_peer record rather than
     on an rxrpc_call because many RPC calls won't live long enough to
     generate more than one sample.

 (4) The calculated SRTT value is in units of 8ths of a microsecond rather
     than nanoseconds.

The (S)RTT and RTO values are displayed in /proc/net/rxrpc/peers.

Fixes: 17926a7932 ([AF_RXRPC]: Provide secure RxRPC sockets for use by userspace and kernel both"")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-05-11 16:42:28 +01:00
Paul Blakey
2c8897953f netfilter: flowtable: Add pending bit for offload work
Gc step can queue offloaded flow del work or stats work.
Those work items can race each other and a flow could be freed
before the stats work is executed and querying it.
To avoid that, add a pending bit that if a work exists for a flow
don't queue another work for it.
This will also avoid adding multiple stats works in case stats work
didn't complete but gc step started again.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-11 16:26:33 +02:00
Marcel Holtmann
e625e50cee Bluetooth: Introduce debug feature when dynamic debug is disabled
In case dynamic debug is disabled, this feature allows a vendor platform
to provide debug statement printing.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:16:27 +02:00
Marcel Holtmann
a10c907ce0 Bluetooth: Add support for experimental features configuration
To enable platform specific experimental features, introduce this new set of
management commands and events.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Marcel Holtmann
d5cc6626b3 Bluetooth: Introduce HCI_MGMT_HDEV_OPTIONAL option
When setting HCI_MGMT_HDEV_OPTIONAL it is possible to target a specific
conntroller or a global interface.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Marcel Holtmann
14a81bf021 Bluetooth: replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language extension
to the C90 standard, but the preferred mechanism to declare variable-length
types such as these ones is a flexible array member.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-05-11 12:13:38 +02:00
Vladimir Oltean
3b7bc1f091 net: dsa: introduce a dsa_switch_find function
Somewhat similar to dsa_tree_find, dsa_switch_find returns a dsa_switch
structure pointer by searching for its tree index and switch index (the
parameters from dsa,member). To be used, for example, by drivers who
implement .crosschip_bridge_join and need a reference to the other
switch indicated to by the tree_index and sw_index arguments.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
Vladimir Oltean
f66a6a69f9 net: dsa: permit cross-chip bridging between all trees in the system
One way of utilizing DSA is by cascading switches which do not all have
compatible taggers. Consider the following real-life topology:

      +---------------------------------------------------------------+
      | LS1028A                                                       |
      |               +------------------------------+                |
      |               |      DSA master for Felix    |                |
      |               |(internal ENETC port 2: eno2))|                |
      |  +------------+------------------------------+-------------+  |
      |  | Felix embedded L2 switch                                |  |
      |  |                                                         |  |
      |  | +--------------+   +--------------+   +--------------+  |  |
      |  | |DSA master for|   |DSA master for|   |DSA master for|  |  |
      |  | |  SJA1105 1   |   |  SJA1105 2   |   |  SJA1105 3   |  |  |
      |  | |(Felix port 1)|   |(Felix port 2)|   |(Felix port 3)|  |  |
      +--+-+--------------+---+--------------+---+--------------+--+--+

+-----------------------+ +-----------------------+ +-----------------------+
|   SJA1105 switch 1    | |   SJA1105 switch 2    | |   SJA1105 switch 3    |
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+
|sw1p0|sw1p1|sw1p2|sw1p3| |sw2p0|sw2p1|sw2p2|sw2p3| |sw3p0|sw3p1|sw3p2|sw3p3|
+-----+-----+-----+-----+ +-----+-----+-----+-----+ +-----+-----+-----+-----+

The above can be described in the device tree as follows (obviously not
complete):

mscc_felix {
	dsa,member = <0 0>;
	ports {
		port@4 {
			ethernet = <&enetc_port2>;
		};
	};
};

sja1105_switch1 {
	dsa,member = <1 1>;
	ports {
		port@4 {
			ethernet = <&mscc_felix_port1>;
		};
	};
};

sja1105_switch2 {
	dsa,member = <2 2>;
	ports {
		port@4 {
			ethernet = <&mscc_felix_port2>;
		};
	};
};

sja1105_switch3 {
	dsa,member = <3 3>;
	ports {
		port@4 {
			ethernet = <&mscc_felix_port3>;
		};
	};
};

Basically we instantiate one DSA switch tree for every hardware switch
in the system, but we still give them globally unique switch IDs (will
come back to that later). Having 3 disjoint switch trees makes the
tagger drivers "just work", because net devices are registered for the
3 Felix DSA master ports, and they are also DSA slave ports to the ENETC
port. So packets received on the ENETC port are stripped of their
stacked DSA tags one by one.

Currently, hardware bridging between ports on the same sja1105 chip is
possible, but switching between sja1105 ports on different chips is
handled by the software bridge. This is fine, but we can do better.

In fact, the dsa_8021q tag used by sja1105 is compatible with cascading.
In other words, a sja1105 switch can correctly parse and route a packet
containing a dsa_8021q tag. So if we could enable hardware bridging on
the Felix DSA master ports, cross-chip bridging could be completely
offloaded.

Such as system would be used as follows:

ip link add dev br0 type bridge && ip link set dev br0 up
for port in sw0p0 sw0p1 sw0p2 sw0p3 \
	    sw1p0 sw1p1 sw1p2 sw1p3 \
	    sw2p0 sw2p1 sw2p2 sw2p3; do
	ip link set dev $port master br0
done

The above makes switching between ports on the same row be performed in
hardware, and between ports on different rows in software. Now assume
the Felix switch ports are called swp0, swp1, swp2. By running the
following extra commands:

ip link add dev br1 type bridge && ip link set dev br1 up
for port in swp0 swp1 swp2; do
	ip link set dev $port master br1
done

the CPU no longer sees packets which traverse sja1105 switch boundaries
and can be forwarded directly by Felix. The br1 bridge would not be used
for any sort of traffic termination.

For this to work, we need to give drivers an opportunity to listen for
bridging events on DSA trees other than their own, and pass that other
tree index as argument. I have made the assumption, for the moment, that
the other existing DSA notifiers don't need to be broadcast to other
trees. That assumption might turn out to be incorrect. But in the
meantime, introduce a dsa_broadcast function, similar in purpose to
dsa_port_notify, which is used only by the bridging notifiers.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
Vladimir Oltean
9eb8eff0cf net: bridge: allow enslaving some DSA master network devices
Commit 8db0a2ee2c ("net: bridge: reject DSA-enabled master netdevices
as bridge members") added a special check in br_if.c in order to check
for a DSA master network device with a tagging protocol configured. This
was done because back then, such devices, once enslaved in a bridge
would become inoperative and would not pass DSA tagged traffic anymore
due to br_handle_frame returning RX_HANDLER_CONSUMED.

But right now we have valid use cases which do require bridging of DSA
masters. One such example is when the DSA master ports are DSA switch
ports themselves (in a disjoint tree setup). This should be completely
equivalent, functionally speaking, from having multiple DSA switches
hanging off of the ports of a switchdev driver. So we should allow the
enslaving of DSA tagged master network devices.

Instead of the regular br_handle_frame(), install a new function
br_handle_frame_dummy() on these DSA masters, which returns
RX_HANDLER_PASS in order to call into the DSA specific tagging protocol
handlers, and lift the restriction from br_add_if.

Suggested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Tested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-10 19:52:33 -07:00
Arnd Bergmann
2c407aca64 netfilter: conntrack: avoid gcc-10 zero-length-bounds warning
gcc-10 warns around a suspicious access to an empty struct member:

net/netfilter/nf_conntrack_core.c: In function '__nf_conntrack_alloc':
net/netfilter/nf_conntrack_core.c:1522:9: warning: array subscript 0 is outside the bounds of an interior zero-length array 'u8[0]' {aka 'unsigned char[0]'} [-Wzero-length-bounds]
 1522 |  memset(&ct->__nfct_init_offset[0], 0,
      |         ^~~~~~~~~~~~~~~~~~~~~~~~~~
In file included from net/netfilter/nf_conntrack_core.c:37:
include/net/netfilter/nf_conntrack.h:90:5: note: while referencing '__nfct_init_offset'
   90 |  u8 __nfct_init_offset[0];
      |     ^~~~~~~~~~~~~~~~~~

The code is correct but a bit unusual. Rework it slightly in a way that
does not trigger the warning, using an empty struct instead of an empty
array. There are probably more elegant ways to do this, but this is the
smallest change.

Fixes: c41884ce05 ("netfilter: conntrack: avoid zeroing timer")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-05-10 23:20:24 +02:00
Saeed Mahameed
76cd622fe2 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
This merge includes updates to bonding driver needed for the rdma stack,
to avoid conflicts with the RDMA branch.

Maor Gottlieb Says:

====================
Bonding: Add support to get xmit slave

The following series adds support to get the LAG master xmit slave by
introducing new .ndo - ndo_get_xmit_slave. Every LAG module can
implement it and it first implemented in the bond driver.
This is follow-up to the RFC discussion [1].

The main motivation for doing this is for drivers that offload part
of the LAG functionality. For example, Mellanox Connect-X hardware
implements RoCE LAG which selects the TX affinity when the resources
are created and port is remapped when it goes down.

The first part of this patchset introduces the new .ndo and add the
support to the bonding module.

The second part adds support to get the RoCE LAG xmit slave by building
skb of the RoCE packet based on the AH attributes and call to the new
.ndo.

The third part change the mlx5 driver driver to set the QP's affinity
port according to the slave which found by the .ndo.
====================

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-09 01:05:30 -07:00
Eric Dumazet
cf86a086a1 net/dst: use a smaller percpu_counter batch for dst entries accounting
percpu_counter_add() uses a default batch size which is quite big
on platforms with 256 cpus. (2*256 -> 512)

This means dst_entries_get_fast() can be off by +/- 2*(nr_cpus^2)
(131072 on servers with 256 cpus)

Reduce the batch size to something more reasonable, and
add logic to ip6_dst_gc() to call dst_entries_get_slow()
before calling the _very_ expensive fib6_run_gc() function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2020-05-08 21:33:33 -07:00
Stanislav Fomichev
8086fbaf49 bpf: Allow any port in bpf_bind helper
We want to have a tighter control on what ports we bind to in
the BPF_CGROUP_INET{4,6}_CONNECT hooks even if it means
connect() becomes slightly more expensive. The expensive part
comes from the fact that we now need to call inet_csk_get_port()
that verifies that the port is not used and allocates an entry
in the hash table for it.

Since we can't rely on "snum || !bind_address_no_port" to prevent
us from calling POST_BIND hook anymore, let's add another bind flag
to indicate that the call site is BPF program.

v5:
* fix wrong AF_INET (should be AF_INET6) in the bpf program for v6

v3:
* More bpf_bind documentation refinements (Martin KaFai Lau)
* Add UDP tests as well (Martin KaFai Lau)
* Don't start the thread, just do socket+bind+listen (Martin KaFai Lau)

v2:
* Update documentation (Andrey Ignatov)
* Pass BIND_FORCE_ADDRESS_NO_PORT conditionally (Andrey Ignatov)

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-5-sdf@google.com
2020-05-09 00:48:20 +02:00
Stanislav Fomichev
cb0721c7e2 net: Refactor arguments of inet{,6}_bind
The intent is to add an additional bind parameter in the next commit.
Instead of adding another argument, let's convert all existing
flag arguments into an extendable bit field.

No functional changes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Andrey Ignatov <rdna@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200508174611.228805-4-sdf@google.com
2020-05-09 00:48:20 +02:00
Eric Biggers
2aaba014b5 crypto: lib/sha1 - remove unnecessary includes of linux/cryptohash.h
<linux/cryptohash.h> sounds very generic and important, like it's the
header to include if you're doing cryptographic hashing in the kernel.
But actually it only includes the library implementation of the SHA-1
compression function (not even the full SHA-1).  This should basically
never be used anymore; SHA-1 is no longer considered secure, and there
are much better ways to do cryptographic hashing in the kernel.

Most files that include this header don't actually need it.  So in
preparation for removing it, remove all these unneeded includes of it.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2020-05-08 15:32:17 +10:00
Eric Dumazet
ae46f184bc bonding: propagate transmit status
Currently, bonding always returns NETDEV_TX_OK to its caller.

It is worth trying to be more accurate : TCP for instance
can have different recovery strategies if it can have more
precise status, if packet was dropped by slave qdisc.

This is especially important when host is under stress.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
Eric Dumazet
f78ed2204d netpoll: accept NULL np argument in netpoll_send_skb()
netpoll_send_skb() callers seem to leak skb if
the np pointer is NULL. While this should not happen, we
can make the code more robust.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 18:11:07 -07:00
Vladimir Oltean
e1eea81120 net: dsa: introduce a dsa_port_from_netdev public helper
As its implementation shows, this is synonimous with calling
dsa_slave_dev_check followed by dsa_slave_to_port, so it is quite simple
already and provides functionality which is already there.

However there is now a need for these functions outside dsa_priv.h, for
example in drivers that perform mirroring and redirection through
tc-flower offloads (they are given raw access to the flow_cls_offload
structure), where they need to call this function on act->dev.

But simply exporting dsa_slave_to_port would make it non-inline and
would result in an extra function call in the hotpath, as can be seen
for example in sja1105:

Before:

000006dc <sja1105_xmit>:
{
 6dc:	e92d4ff0 	push	{r4, r5, r6, r7, r8, r9, sl, fp, lr}
 6e0:	e1a04000 	mov	r4, r0
 6e4:	e591958c 	ldr	r9, [r1, #1420]	; 0x58c <- Inline dsa_slave_to_port
 6e8:	e1a05001 	mov	r5, r1
 6ec:	e24dd004 	sub	sp, sp, #4
	u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
 6f0:	e1c901d8 	ldrd	r0, [r9, #24]
 6f4:	ebfffffe 	bl	0 <dsa_8021q_tx_vid>
			6f4: R_ARM_CALL	dsa_8021q_tx_vid
	u8 pcp = netdev_txq_to_tc(netdev, queue_mapping);
 6f8:	e1d416b0 	ldrh	r1, [r4, #96]	; 0x60
	u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
 6fc:	e1a08000 	mov	r8, r0

After:

000006e4 <sja1105_xmit>:
{
 6e4:	e92d4ff0 	push	{r4, r5, r6, r7, r8, r9, sl, fp, lr}
 6e8:	e1a04000 	mov	r4, r0
 6ec:	e24dd004 	sub	sp, sp, #4
	struct dsa_port *dp = dsa_slave_to_port(netdev);
 6f0:	e1a00001 	mov	r0, r1
{
 6f4:	e1a05001 	mov	r5, r1
	struct dsa_port *dp = dsa_slave_to_port(netdev);
 6f8:	ebfffffe 	bl	0 <dsa_slave_to_port>
			6f8: R_ARM_CALL	dsa_slave_to_port
 6fc:	e1a09000 	mov	r9, r0
	u16 tx_vid = dsa_8021q_tx_vid(dp->ds, dp->index);
 700:	e1c001d8 	ldrd	r0, [r0, #24]
 704:	ebfffffe 	bl	0 <dsa_8021q_tx_vid>
			704: R_ARM_CALL	dsa_8021q_tx_vid

Because we want to avoid possible performance regressions, introduce
this new function which is designed to be public.

Suggested-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:31:57 -07:00
Arnd Bergmann
ee28755668 net: bareudp: avoid uninitialized variable warning
clang points out that building without IPv6 would lead to returning
an uninitialized variable if a packet with family!=AF_INET is
passed into bareudp_udp_encap_recv():

drivers/net/bareudp.c:139:6: error: variable 'err' is used uninitialized whenever 'if' condition is false [-Werror,-Wsometimes-uninitialized]
        if (family == AF_INET)
            ^~~~~~~~~~~~~~~~~
drivers/net/bareudp.c:146:15: note: uninitialized use occurs here
        if (unlikely(err)) {
                     ^~~
include/linux/compiler.h:78:42: note: expanded from macro 'unlikely'
 # define unlikely(x)    __builtin_expect(!!(x), 0)
                                            ^
drivers/net/bareudp.c:139:2: note: remove the 'if' if its condition is always true
        if (family == AF_INET)
        ^~~~~~~~~~~~~~~~~~~~~~

This cannot happen in practice, so change the condition in a way that
gcc sees the IPv4 case as unconditionally true here.
For consistency, change all the similar constructs in this file the
same way, using "if(IS_ENABLED())" instead of #if IS_ENABLED()".

Fixes: 571912c69f ("net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 17:28:18 -07:00
Maciej Żenczykowski
64082b67ba net: remove spurious declaration of tcp_default_init_rwnd()
it doesn't actually exist...

Test: builds and 'git grep tcp_default_init_rwnd' comes up empty
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-07 07:58:55 -07:00
David S. Miller
3793faad7b Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflicts were all overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 22:10:13 -07:00
Pablo Neira Ayuso
16f8036086 net: flow_offload: skip hw stats check for FLOW_ACTION_HW_STATS_DONT_CARE
This patch adds FLOW_ACTION_HW_STATS_DONT_CARE which tells the driver
that the frontend does not need counters, this hw stats type request
never fails. The FLOW_ACTION_HW_STATS_DISABLED type explicitly requests
the driver to disable the stats, however, if the driver cannot disable
counters, it bails out.

TCA_ACT_HW_STATS_* maintains the 1:1 mapping with FLOW_ACTION_HW_STATS_*
except by disabled which is mapped to FLOW_ACTION_HW_STATS_DISABLED
(this is 0 in tc). Add tc_act_hw_stats() to perform the mapping between
TCA_ACT_HW_STATS_* and FLOW_ACTION_HW_STATS_*.

Fixes: 319a1d1947 ("flow_offload: check for basic action hw stats type")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 20:13:10 -07:00
Eric Dumazet
8dc242ad66 tcp: refine tcp_pacing_delay() for very low pacing rates
With the addition of horizon feature to sch_fq, we noticed some
suboptimal behavior of extremely low pacing rate TCP flows, especially
when TCP is not aware of a drop happening in lower stacks.

Back in commit 3f80e08f40 ("tcp: add tcp_reset_xmit_timer() helper"),
tcp_pacing_delay() was added to estimate an extra delay to add to standard
rto timers.

This patch removes the skb argument from this helper and
tcp_reset_xmit_timer() because it makes more sense to simply
consider the time at which next packet is allowed to be sent,
instead of the time of whatever packet has been sent.

This avoids arming RTO timer too soon and removes
spurious horizon drops.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:29:38 -07:00
Fernando Gont
969c54646a ipv6: Implement draft-ietf-6man-rfc4941bis
Implement the upcoming rev of RFC4941 (IPv6 temporary addresses):
https://tools.ietf.org/html/draft-ietf-6man-rfc4941bis-09

* Reduces the default Valid Lifetime to 2 days
  The number of extra addresses employed when Valid Lifetime was
  7 days exacerbated the stress caused on network
  elements/devices. Additionally, the motivation for temporary
  addresses is indeed privacy and reduced exposure. With a
  default Valid Lifetime of 7 days, an address that becomes
  revealed by active communication is reachable and exposed for
  one whole week. The only use case for a Valid Lifetime of 7
  days could be some application that is expecting to have long
  lived connections. But if you want to have a long lived
  connections, you shouldn't be using a temporary address in the
  first place. Additionally, in the era of mobile devices, general
  applications should nevertheless be prepared and robust to
  address changes (e.g. nodes swap wifi <-> 4G, etc.)

* Employs different IIDs for different prefixes
  To avoid network activity correlation among addresses configured
  for different prefixes

* Uses a simpler algorithm for IID generation
  No need to store "history" anywhere

Signed-off-by: Fernando Gont <fgont@si6networks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-06 17:00:02 -07:00
Florian Westphal
2ab6096db2 xfrm: remove output_finish indirection from xfrm_state_afinfo
There are only two implementaions, one for ipv4 and one for ipv6.

Both are almost identical, they clear skb->cb[], set the TRANSFORMED flag
in IP(6)CB and then call the common xfrm_output() function.

By placing the IPCB handling into the common function, we avoid the need
for the output_finish indirection as the output functions can simply
use xfrm_output().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
f3075f48dd xfrm: remove extract_output indirection from xfrm_state_afinfo
Move this to xfrm_output.c.  This avoids the state->extract_output
indirection.

This patch also removes the duplicated __xfrm6_extract_header helper
added in an earlier patch, we can now use the one from xfrm_inout.h .

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
ede64dd2bf xfrm: place xfrm6_local_dontfrag in xfrm.h
so next patch can re-use it from net/xfrm/xfrm_output.c without
causing a linker error when IPV6 is a module.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
3e50ddd8b8 xfrm: expose local_rxpmtu via ipv6_stubs
We cannot call this function from the core kernel unless we would force
CONFIG_IPV6=y.

Therefore expose this via ipv6_stubs so we can call it from net/xfrm
in the followup patch.

Since the call is expected to be unlikely, no extra code for the IPV6=y
case is added and we will always eat the indirection cost.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
171916cbd5 xfrm: move xfrm4_extract_header to common helper
The function only initializes the XFRM CB in the skb.

After previous patch xfrm4_extract_header is only called from
net/xfrm/xfrm_{input,output}.c.

Because of IPV6=m linker errors the ipv6 equivalent
(xfrm6_extract_header) was already placed in xfrm_inout.h because
we can't call functions residing in a module from the core.

So do the same for the ipv4 helper and place it next to the ipv6 one.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
a269fbfc4e xfrm: state: remove extract_input indirection from xfrm_state_afinfo
In order to keep CONFIG_IPV6=m working, xfrm6_extract_header needs to be
duplicated.  It will be removed again in a followup change when the
remaining caller is moved to net/xfrm as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
Florian Westphal
6d64be3da2 xfrm: avoid extract_output indirection for ipv4
We can use a direct call for ipv4, so move the needed functions
to net/xfrm/xfrm_output.c and call them directly.

For ipv6 the indirection can be avoided as well but it will need
a bit more work -- to ease review it will be done in another patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-05-06 09:40:08 +02:00
William Tu
f989d546a2 erspan: Add type I version 0 support.
The Type I ERSPAN frame format is based on the barebones
IP + GRE(4-byte) encapsulation on top of the raw mirrored frame.
Both type I and II use 0x88BE as protocol type. Unlike type II
and III, no sequence number or key is required.
To creat a type I erspan tunnel device:
  $ ip link add dev erspan11 type erspan \
            local 172.16.1.100 remote 172.16.1.200 \
            erspan_ver 0

Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-05 13:23:29 -07:00
Magnus Karlsson
07bf2d97d1 xsk: Remove unnecessary member in xdp_umem
Remove the unnecessary member of address in struct xdp_umem as it is
only used during the umem registration. No need to carry this around
as it is not used during run-time nor when unregistering the umem.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1588599232-24897-3-git-send-email-magnus.karlsson@intel.com
2020-05-04 22:56:26 +02:00
Magnus Karlsson
e4e5aefc11 xsk: Change two variable names for increased clarity
Change two variables names so that it is clearer what they
represent. The first one is xsk_list that in fact only contains the
list of AF_XDP sockets with a Tx component. Change this to xsk_tx_list
for improved clarity. The second variable is size in the ring
structure. One might think that this is the size of the ring, but it
is in fact the size of the umem, copied into the ring structure to
improve performance. Rename this variable umem_size to avoid any
confusion.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Link: https://lore.kernel.org/bpf/1588599232-24897-2-git-send-email-magnus.karlsson@intel.com
2020-05-04 22:56:26 +02:00
Cong Wang
e7511f560f bonding: remove useless stats_lock_key
After commit b3e80d44f5
("bonding: fix lockdep warning in bond_get_stats()") the dynamic
key is no longer necessary, as we compute nest level at run-time.
So, we can just remove it to save some lockdep key entries.

Test commands:
 ip link add bond0 type bond
 ip link add bond1 type bond
 ip link set bond0 master bond1
 ip link set bond0 nomaster
 ip link set bond1 master bond0

Reported-and-tested-by: syzbot+aaa6fa4949cc5d9b7b25@syzkaller.appspotmail.com
Cc: Dmitry Vyukov <dvyukov@google.com>
Acked-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 12:05:56 -07:00
Cong Wang
a7df4870d7 net_sched: fix tcm_parent in tc filter dump
When we tell kernel to dump filters from root (ffff:ffff),
those filters on ingress (ffff:0000) are matched, but their
true parents must be dumped as they are. However, kernel
dumps just whatever we tell it, that is either ffff:ffff
or ffff:0000:

 $ nl-cls-list --dev=dummy0 --parent=root
 cls basic dev dummy0 id none parent root prio 49152 protocol ip match-all
 cls basic dev dummy0 id :1 parent root prio 49152 protocol ip match-all
 $ nl-cls-list --dev=dummy0 --parent=ffff:
 cls basic dev dummy0 id none parent ffff: prio 49152 protocol ip match-all
 cls basic dev dummy0 id :1 parent ffff: prio 49152 protocol ip match-all

This is confusing and misleading, more importantly this is
a regression since 4.15, so the old behavior must be restored.

And, when tc filters are installed on a tc class, the parent
should be the classid, rather than the qdisc handle. Commit
edf6711c98 ("net: sched: remove classid and q fields from tcf_proto")
removed the classid we save for filters, we can just restore
this classid in tcf_block.

Steps to reproduce this:
 ip li set dev dummy0 up
 tc qd add dev dummy0 ingress
 tc filter add dev dummy0 parent ffff: protocol arp basic action pass
 tc filter show dev dummy0 root

Before this patch:
 filter protocol arp pref 49152 basic
 filter protocol arp pref 49152 basic handle 0x1
	action order 1: gact action pass
	 random type none pass val 0
	 index 1 ref 1 bind 1

After this patch:
 filter parent ffff: protocol arp pref 49152 basic
 filter parent ffff: protocol arp pref 49152 basic handle 0x1
 	action order 1: gact action pass
 	 random type none pass val 0
	 index 1 ref 1 bind 1

Fixes: a10fa20101 ("net: sched: propagate q and parent from caller down to tcf_fill_node")
Fixes: edf6711c98 ("net: sched: remove classid and q fields from tcf_proto")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-04 11:53:33 -07:00
Po Liu
d29bdd69ec net: schedule: add action gate offloading
Add the gate action to the flow action entry. Add the gate parameters to
the tc_setup_flow_action() queueing to the entries of flow_action_entry
array provide to the driver.

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:08:19 -07:00
Po Liu
a51c328df3 net: qos: introduce a gate control flow action
Introduce a ingress frame gate control flow action.
Tc gate action does the work like this:
Assume there is a gate allow specified ingress frames can be passed at
specific time slot, and be dropped at specific time slot. Tc filter
chooses the ingress frames, and tc gate action would specify what slot
does these frames can be passed to device and what time slot would be
dropped.
Tc gate action would provide an entry list to tell how much time gate
keep open and how much time gate keep state close. Gate action also
assign a start time to tell when the entry list start. Then driver would
repeat the gate entry list cyclically.
For the software simulation, gate action requires the user assign a time
clock type.

Below is the setting example in user space. Tc filter a stream source ip
address is 192.168.0.20 and gate action own two time slots. One is last
200ms gate open let frame pass another is last 100ms gate close let
frames dropped. When the ingress frames have reach total frames over
8000000 bytes, the excessive frames will be dropped in that 200000000ns
time slot.

> tc qdisc add dev eth0 ingress

> tc filter add dev eth0 parent ffff: protocol ip \
	   flower src_ip 192.168.0.20 \
	   action gate index 2 clockid CLOCK_TAI \
	   sched-entry open 200000000 -1 8000000 \
	   sched-entry close 100000000 -1 -1

> tc chain del dev eth0 ingress chain 0

"sched-entry" follow the name taprio style. Gate state is
"open"/"close". Follow with period nanosecond. Then next item is internal
priority value means which ingress queue should put. "-1" means
wildcard. The last value optional specifies the maximum number of
MSDU octets that are permitted to pass the gate during the specified
time interval.
Base-time is not set will be 0 as default, as result start time would
be ((N + 1) * cycletime) which is the minimal of future time.

Below example shows filtering a stream with destination mac address is
10:00:80:00:00:00 and ip type is ICMP, follow the action gate. The gate
action would run with one close time slot which means always keep close.
The time cycle is total 200000000ns. The base-time would calculate by:

 1357000000000 + (N + 1) * cycletime

When the total value is the future time, it will be the start time.
The cycletime here would be 200000000ns for this case.

> tc filter add dev eth0 parent ffff:  protocol ip \
	   flower skip_hw ip_proto icmp dst_mac 10:00:80:00:00:00 \
	   action gate index 12 base-time 1357000000000 \
	   sched-entry close 200000000 -1 -1 \
	   clockid CLOCK_TAI

Signed-off-by: Po Liu <Po.Liu@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 16:08:19 -07:00
Cambda Zhu
f0628c524f net: Replace the limit of TCP_LINGER2 with TCP_FIN_TIMEOUT_MAX
This patch changes the behavior of TCP_LINGER2 about its limit. The
sysctl_tcp_fin_timeout used to be the limit of TCP_LINGER2 but now it's
only the default value. A new macro named TCP_FIN_TIMEOUT_MAX is added
as the limit of TCP_LINGER2, which is 2 minutes.

Since TCP_LINGER2 used sysctl_tcp_fin_timeout as the default value
and the limit in the past, the system administrator cannot set the
default value for most of sockets and let some sockets have a greater
timeout. It might be a mistake that let the sysctl to be the limit of
the TCP_LINGER2. Maybe we can add a new sysctl to set the max of
TCP_LINGER2, but FIN-WAIT-2 timeout is usually no need to be too long
and 2 minutes are legal considering TCP specs.

Changes in v3:
- Remove the new socket option and change the TCP_LINGER2 behavior so
  that the timeout can be set to value between sysctl_tcp_fin_timeout
  and 2 minutes.

Changes in v2:
- Add int overflow check for the new socket option.

Changes in v1:
- Add a new socket option to set timeout greater than
  sysctl_tcp_fin_timeout.

Signed-off-by: Cambda Zhu <cambda@linux.alibaba.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 15:12:14 -07:00
David Ahern
8f34e53b60 ipv6: Use global sernum for dst validation with nexthop objects
Nik reported a bug with pcpu dst cache when nexthop objects are
used illustrated by the following:
    $ ip netns add foo
    $ ip -netns foo li set lo up
    $ ip -netns foo addr add 2001:db8:11::1/128 dev lo
    $ ip netns exec foo sysctl net.ipv6.conf.all.forwarding=1
    $ ip li add veth1 type veth peer name veth2
    $ ip li set veth1 up
    $ ip addr add 2001:db8:10::1/64 dev veth1
    $ ip li set dev veth2 netns foo
    $ ip -netns foo li set veth2 up
    $ ip -netns foo addr add 2001:db8:10::2/64 dev veth2
    $ ip -6 nexthop add id 100 via 2001:db8:10::2 dev veth1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Create a pcpu entry on cpu 0:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1

    Re-add the route entry:
    $ ip -6 ro del 2001:db8:11::1
    $ ip -6 route add 2001:db8:11::1/128 nhid 100

    Route get on cpu 0 returns the stale pcpu:
    $ taskset -a -c 0 ip -6 route get 2001:db8:11::1
    RTNETLINK answers: Network is unreachable

    While cpu 1 works:
    $ taskset -a -c 1 ip -6 route get 2001:db8:11::1
    2001:db8:11::1 from :: via 2001:db8:10::2 dev veth1 src 2001:db8:10::1 metric 1024 pref medium

Conversion of FIB entries to work with external nexthop objects
missed an important difference between IPv4 and IPv6 - how dst
entries are invalidated when the FIB changes. IPv4 has a per-network
namespace generation id (rt_genid) that is bumped on changes to the FIB.
Checking if a dst_entry is still valid means comparing rt_genid in the
rtable to the current value of rt_genid for the namespace.

IPv6 also has a per network namespace counter, fib6_sernum, but the
count is saved per fib6_node. With the per-node counter only dst_entries
based on fib entries under the node are invalidated when changes are
made to the routes - limiting the scope of invalidations. IPv6 uses a
reference in the rt6_info, 'from', to track the corresponding fib entry
used to create the dst_entry. When validating a dst_entry, the 'from'
is used to backtrack to the fib6_node and check the sernum of it to the
cookie passed to the dst_check operation.

With the inline format (nexthop definition inline with the fib6_info),
dst_entries cached in the fib6_nh have a 1:1 correlation between fib
entries, nexthop data and dst_entries. With external nexthops, IPv6
looks more like IPv4 which means multiple fib entries across disparate
fib6_nodes can all reference the same fib6_nh. That means validation
of dst_entries based on external nexthops needs to use the IPv4 format
- the per-network namespace counter.

Add sernum to rt6_info and set it when creating a pcpu dst entry. Update
rt6_get_cookie to return sernum if it is set and update dst_check for
IPv6 to look for sernum set and based the check on it if so. Finally,
rt6_get_pcpu_route needs to validate the cached entry before returning
a pcpu entry (similar to the rt_cache_valid calls in __mkroute_input and
__mkroute_output for IPv4).

This problem only affects routes using the new, external nexthops.

Thanks to the kbuild test robot for catching the IS_ENABLED needed
around rt_genid_ipv6 before I sent this out.

Fixes: 5b98324ebe ("ipv6: Allow routes to use nexthop objects")
Reported-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@kernel.org>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-05-01 12:46:30 -07:00
Maor Gottlieb
6b447e76ed bonding: Add array of all slaves
Keep all slaves in array so it could be used to get the xmit slave
assume all the slaves are active.
The logic to add slave to the array is like the usable slaves, except
that we also add slaves that currently can't transmit - not up or active.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:38 -07:00
Maor Gottlieb
34b37e204d bonding/alb: Add helper functions to get the xmit slave
Add two helper functions to get the xmit slave of bond in alb or tlb
mode. Extract the logic of find the xmit slave from the xmit flow
to function. Xmit flow will xmit through this slave and in the
following patches the new .ndo will call to the helper function
to return the xmit slave.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
Maor Gottlieb
ed7d4f023b bonding: Rename slave_arr to usable_slaves
Rename slave_arr to usable_slaves, since we will have two arrays,
one for the usable slaves and the other to all slaves.

Signed-off-by: Maor Gottlieb <maorg@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-05-01 12:15:37 -07:00
Toke Høiland-Jørgensen
b723748750 tunnel: Propagate ECT(1) when decapsulating as recommended by RFC6040
RFC 6040 recommends propagating an ECT(1) mark from an outer tunnel header
to the inner header if that inner header is already marked as ECT(0). When
RFC 6040 decapsulation was implemented, this case of propagation was not
added. This simply appears to be an oversight, so let's fix that.

Fixes: eccc1bb8d4 ("tunnel: drop packet if ECN present with not-ECT")
Reported-by: Bob Briscoe <ietf@bobbriscoe.net>
Reported-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com>
Cc: Dave Taht <dave.taht@gmail.com>
Cc: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 20:32:15 -07:00
Johannes Berg
d07dcf9aad netlink: add infrastructure to expose policies to userspace
Add, and use in generic netlink, helpers to dump out a netlink
policy to userspace, including all the range validation data,
nested policies etc.

This lets userspace discover what the kernel understands.

For families/commands other than generic netlink, the helpers
need to be used directly in an appropriate command, or we can
add some infrastructure (a new netlink family) that those can
register their policies with for introspection. I'm not that
familiar with non-generic netlink, so that's left out for now.

The data exposed to userspace also includes min and max length
for binary/string data, I've done that instead of letting the
userspace tools figure out whether min/max is intended based
on the type so that we can extend this later in the kernel, we
might want to just use the range data for example.

Because of this, I opted to not directly expose the NLA_*
values, even if some of them are already exposed via BPF, as
with min/max length we don't need to have different types here
for NLA_BINARY/NLA_MIN_LEN/NLA_EXACT_LEN, we just make them
all NL_ATTR_TYPE_BINARY with min/max length optionally set.

Similarly, we don't really need NLA_MSECS, and perhaps can
remove it in the future - but not if we encode it into the
userspace API now. It gets mapped to NL_ATTR_TYPE_U64 here.

Note that the exposing here corresponds to the strict policy
interpretation, and NLA_UNSPEC items are omitted entirely.
To get those, change them to NLA_MIN_LEN which behaves in
exactly the same way, but is exposed.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
2c28ae48f2 netlink: factor out policy range helpers
Add helpers to get the policy's signed/unsigned range
validation data.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
c7721c05a6 netlink: remove NLA_EXACT_LEN_WARN
Use a validation type instead, so we can later expose
the NLA_* values to userspace for policy descriptions.

Some transformations were done with this spatch:

    @@
    identifier p;
    expression X, L, A;
    @@
    struct nla_policy p[X] = {
    [A] =
    -{ .type = NLA_EXACT_LEN_WARN, .len = L },
    +NLA_POLICY_EXACT_LEN_WARN(L),
    ...
    };

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
da4063bdfc netlink: allow NLA_MSECS to have range validation
Since NLA_MSECS is really equivalent to NLA_U64, allow
it to have range validation as well.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
d06a09b94c netlink: extend policy range validation
Using a pointer to a struct indicating the min/max values,
extend the ability to do range validation for arbitrary
values. Small values in the s16 range can be kept in the
policy directly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:42 -07:00
Johannes Berg
47a1494b82 netlink: remove type-unsafe validation_data pointer
In the netlink policy, we currently have a void *validation_data
that's pointing to different things:
 * a u32 value for bitfield32,
 * the netlink policy for nested/nested array
 * the string for NLA_REJECT

Remove the pointer and place appropriate type-safe items in the
union instead.

While at it, completely dissolve the pointer for the bitfield32
case and just put the value there directly.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 17:51:41 -07:00
Eric Dumazet
a70437cc09 tcp: add hrtimer slack to sack compression
Add a sysctl to control hrtimer slack, default of 100 usec.

This gives the opportunity to reduce system overhead,
and help very short RTT flows.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 13:24:01 -07:00
Mauro Carvalho Chehab
66d495d0a5 docs: networking: convert radiotap-headers.txt to ReST
- add SPDX header;
- adjust title markup;
- mark code blocks and literals as such;
- adjust identation, whitespaces and blank lines where needed;
- add to networking/index.rst.

Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:56:37 -07:00
Paolo Abeni
cfde141ea3 mptcp: move option parsing into mptcp_incoming_options()
The mptcp_options_received structure carries several per
packet flags (mp_capable, mp_join, etc.). Such fields must
be cleared on each packet, even on dropped ones or packet
not carrying any MPTCP options, but the current mptcp
code clears them only on TCP option reset.

On several races/corner cases we end-up with stray bits in
incoming options, leading to WARN_ON splats. e.g.:

[  171.164906] Bad mapping: ssn=32714 map_seq=1 map_data_len=32713
[  171.165006] WARNING: CPU: 1 PID: 5026 at net/mptcp/subflow.c:533 warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[  171.167632] Modules linked in: ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel geneve ip6_udp_tunnel udp_tunnel macsec macvtap tap ipvlan macvlan 8021q garp mrp xfrm_interface veth netdevsim nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun binfmt_misc intel_rapl_msr intel_rapl_common rfkill kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel joydev virtio_balloon pcspkr i2c_piix4 sunrpc ip_tables xfs libcrc32c crc32c_intel serio_raw virtio_console ata_generic virtio_blk virtio_net net_failover failover ata_piix libata
[  171.199464] CPU: 1 PID: 5026 Comm: repro Not tainted 5.7.0-rc1.mptcp_f227fdf5d388+ #95
[  171.200886] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
[  171.202546] RIP: 0010:warn_bad_map (linux-mptcp/net/mptcp/subflow.c:533 linux-mptcp/net/mptcp/subflow.c:531)
[  171.206537] Code: c1 ea 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 1d 8b 55 3c 44 89 e6 48 c7 c7 20 51 13 95 e8 37 8b 22 fe <0f> 0b 48 83 c4 08 5b 5d 41 5c c3 89 4c 24 04 e8 db d6 94 fe 8b 4c
[  171.220473] RSP: 0018:ffffc90000150560 EFLAGS: 00010282
[  171.221639] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
[  171.223108] RDX: 0000000000000000 RSI: 0000000000000008 RDI: fffff5200002a09e
[  171.224388] RBP: ffff8880aa6e3c00 R08: 0000000000000001 R09: fffffbfff2ec9955
[  171.225706] R10: ffffffff9764caa7 R11: fffffbfff2ec9954 R12: 0000000000007fca
[  171.227211] R13: ffff8881066f4a7f R14: ffff8880aa6e3c00 R15: 0000000000000020
[  171.228460] FS:  00007f8623719740(0000) GS:ffff88810be00000(0000) knlGS:0000000000000000
[  171.230065] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  171.231303] CR2: 00007ffdab190a50 CR3: 00000001038ea006 CR4: 0000000000160ee0
[  171.232586] Call Trace:
[  171.233109]  <IRQ>
[  171.233531] get_mapping_status (linux-mptcp/net/mptcp/subflow.c:691)
[  171.234371] mptcp_subflow_data_available (linux-mptcp/net/mptcp/subflow.c:736 linux-mptcp/net/mptcp/subflow.c:832)
[  171.238181] subflow_state_change (linux-mptcp/net/mptcp/subflow.c:1085 (discriminator 1))
[  171.239066] tcp_fin (linux-mptcp/net/ipv4/tcp_input.c:4217)
[  171.240123] tcp_data_queue (linux-mptcp/./include/linux/compiler.h:199 linux-mptcp/net/ipv4/tcp_input.c:4822)
[  171.245083] tcp_rcv_established (linux-mptcp/./include/linux/skbuff.h:1785 linux-mptcp/./include/net/tcp.h:1774 linux-mptcp/./include/net/tcp.h:1847 linux-mptcp/net/ipv4/tcp_input.c:5238 linux-mptcp/net/ipv4/tcp_input.c:5730)
[  171.254089] tcp_v4_rcv (linux-mptcp/./include/linux/spinlock.h:393 linux-mptcp/net/ipv4/tcp_ipv4.c:2009)
[  171.258969] ip_protocol_deliver_rcu (linux-mptcp/net/ipv4/ip_input.c:204 (discriminator 1))
[  171.260214] ip_local_deliver_finish (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/ipv4/ip_input.c:232)
[  171.261389] ip_local_deliver (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:252)
[  171.265884] ip_rcv (linux-mptcp/./include/linux/netfilter.h:307 linux-mptcp/./include/linux/netfilter.h:301 linux-mptcp/net/ipv4/ip_input.c:539)
[  171.273666] process_backlog (linux-mptcp/./include/linux/rcupdate.h:651 linux-mptcp/net/core/dev.c:6135)
[  171.275328] net_rx_action (linux-mptcp/net/core/dev.c:6572 linux-mptcp/net/core/dev.c:6640)
[  171.280472] __do_softirq (linux-mptcp/./arch/x86/include/asm/jump_label.h:25 linux-mptcp/./include/linux/jump_label.h:200 linux-mptcp/./include/trace/events/irq.h:142 linux-mptcp/kernel/softirq.c:293)
[  171.281379] do_softirq_own_stack (linux-mptcp/arch/x86/entry/entry_64.S:1083)
[  171.282358]  </IRQ>

We could address the issue clearing explicitly the relevant fields
in several places - tcp_parse_option, tcp_fast_parse_options,
possibly others.

Instead we move the MPTCP option parsing into the already existing
mptcp ingress hook, so that we need to clear the fields in a single
place.

This allows us dropping an MPTCP hook from the TCP code and
removing the quite large mptcp_options_received from the tcp_sock
struct. On the flip side, the MPTCP sockets will traverse the
option space twice (in tcp_parse_option() and in
mptcp_incoming_options(). That looks acceptable: we already
do that for syn and 3rd ack packets, plain TCP socket will
benefit from it, and even MPTCP sockets will experience better
code locality, reducing the jumps between TCP and MPTCP code.

v1 -> v2:
 - rebased on current '-net' tree

Fixes: 648ef4b886 ("mptcp: Implement MPTCP receive path")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
Paolo Abeni
263e1201a2 mptcp: consolidate synack processing.
Currently the MPTCP code uses 2 hooks to process syn-ack
packets, mptcp_rcv_synsent() and the sk_rx_dst_set()
callback.

We can drop the first, moving the relevant code into the
latter, reducing the hooking into the TCP code. This is
also needed by the next patch.

v1 -> v2:
 - use local tcp sock ptr instead of casting the sk variable
   several times - DaveM

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-30 12:23:22 -07:00
David S. Miller
323e395f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for nf-next:

1) Add IPS_HW_OFFLOAD status bit, from Bodong Wang.

2) Remove 128-bit limit on the set element data area, rise it
   to 64 bytes.

3) Report EOPNOTSUPP for unsupported NAT types and flags.

4) Set up nft_nat flags from the control plane path.

5) Add helper functions to set up the nf_nat_range2 structure.

6) Add netmap support for nft_nat.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-29 14:14:44 -07:00
Rajkumar Manoharan
60689de46c mac80211: fix memory overlap due to variable length param
As of now HE operation element in bss_conf includes variable length
optional field followed by other HE variable. Though the optional
field never be used, actually it is referring to next member of the
bss_conf structure which is not correct. Fix it by declaring needed
HE operation fields within bss_conf itself.

Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Link: https://lore.kernel.org/r/1587768108-25248-2-git-send-email-rmanohar@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-29 16:21:45 +02:00
Roopa Prabhu
4f80116d3d net: ipv4: add sysctl for nexthop api compatibility mode
Current route nexthop API maintains user space compatibility
with old route API by default. Dumps and netlink notifications
support both new and old API format. In systems which have
moved to the new API, this compatibility mode cancels some
of the performance benefits provided by the new nexthop API.

This patch adds new sysctl nexthop_compat_mode which is on
by default but provides the ability to turn off compatibility
mode allowing systems to run entirely with the new routing
API. Old route API behaviour and support is not modified by this
sysctl.

Uses a single sysctl to cover both ipv4 and ipv6 following
other sysctls. Covers dumps and delete notifications as
suggested by David Ahern.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 12:50:37 -07:00
Roopa Prabhu
11dd74b338 net: ipv6: new arg skip_notify to ip6_rt_del
Used in subsequent work to skip route delete
notifications on nexthop deletes.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-28 12:50:37 -07:00
Alain Michaud
220915857e Bluetooth: Adding driver and quirk defs for multi-role LE
This change adds the relevant driver and quirk to allow drivers to
report the le_states as being trustworthy.

This has historically been disabled as controllers did not reliably
support this. In particular, this will be used to relax this condition
for controllers that have been well tested and reliable.

	/* Most controller will fail if we try to create new connections
	 * while we have an existing one in slave role.
	 */
	if (hdev->conn_hash.le_num_slave > 0)
		return NULL;

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-28 11:49:01 +02:00
Sabrina Dubroca
26333c37fc xfrm: add IPv6 support for espintcp
This extends espintcp to support IPv6, building on the existing code
and the new UDPv6 encapsulation support. Most of the code is either
reused directly (stream parser, ULP) or very similar to the IPv4
variant (net/ipv6/esp6.c changes).

The separation of config options for IPv4 and IPv6 espintcp requires a
bit of Kconfig gymnastics to enable the core code.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-28 11:28:36 +02:00
Sabrina Dubroca
0146dca70b xfrm: add support for UDPv6 encapsulation of ESP
This patch adds support for encapsulation of ESP over UDPv6. The code
is very similar to the IPv4 encapsulation implementation, and allows
to easily add espintcp on IPv6 as a follow-up.

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-28 11:28:36 +02:00
Pablo Neira Ayuso
fdb9c405e3 netfilter: nf_tables: allow up to 64 bytes in the set element data area
So far, the set elements could store up to 128-bits in the data area.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-27 22:20:09 +02:00
Horatiu Vultur
c284b54590 switchdev: mrp: Extend switchdev API to offload MRP
Extend switchdev API to add support for MRP. The HW is notified in
following cases:

SWITCHDEV_OBJ_ID_MRP: This is used when a MRP instance is added/removed
  from the MRP ring.

SWITCHDEV_OBJ_ID_RING_ROLE_MRP: This is used when the role of the node
  changes. The current supported roles are MRM and MRC.

SWITCHDEV_OBJ_ID_RING_TEST_MRP: This is used when to start/stop sending
  MRP_Test frames on the mrp ring ports. This is called only on nodes that have
  the role MRM. In case this fails then the SW will generate the frames.

SWITCHDEV_OBJ_ID_RING_STATE_STATE: This is used when the ring changes it states
  to open or closed. This is required to notify HW because the MRP_Test frame
  contains the field MRP_InState which contains this information.

SWITCHDEV_ATTR_ID_MRP_PORT_STATE: This is used when the port's state is
  changed. It can be in blocking/forwarding mode.

SWITCHDEV_ATTR_ID_MRP_PORT_ROLE: This is used when port's role changes. The
  roles of the port can be primary/secondary. This is required to notify HW
  because the MRP_Test frame contains the field MRP_PortRole that contains this
  information.

Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-27 11:40:25 -07:00
Florian Westphal
071c8ed6e8 tcp: mptcp: use mptcp receive buffer space to select rcv window
In MPTCP, the receive window is shared across all subflows, because it
refers to the mptcp-level sequence space.

MPTCP receivers already place incoming packets on the mptcp socket
receive queue and will charge it to the mptcp socket rcvbuf until
userspace consumes the data.

Update __tcp_select_window to use the occupancy of the parent/mptcp
socket instead of the subflow socket in case the tcp socket is part
of a logical mptcp connection.

This commit doesn't change choice of initial window for passive or active
connections.
While it would be possible to change those as well, this adds complexity
(especially when handling MP_JOIN requests).  Furthermore, the MPTCP RFC
specifically says that a MPTCP sender 'MUST NOT use the RCV.WND field
of a TCP segment at the connection level if it does not also carry a DSS
option with a Data ACK field.'

SYN/SYNACK packets do not carry a DSS option with a Data ACK field.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:37:52 -07:00
David S. Miller
d483389678 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes to linux/vermagic.h

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-25 20:18:53 -07:00
YueHaibing
5d7163a117 net: ipv6: remove unused inline function ip6_set_txhash
commit 877d1f6291 ("net: Set sk_txhash from a random number")
left behind this, remove it.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24 16:58:34 -07:00
YueHaibing
7f023ec91c net: sched: remove unused inline function qdisc_reset_all_tx
There's no callers in-tree anymore.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-24 16:57:31 -07:00
Johannes Berg
b6b5c42e3b mac80211: fix two missing documentation entries
Add documentation for two struct entries that was missing.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200424123945.6b23a26ab5e7.I664440ab5f33442df8103253bf5b9fe84be8d58c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:40:20 +02:00
Thomas Pedersen
3b23c184f7 mac80211: add freq_offset to RX status
RX status needs a KHz component, so add freq_offset. We
can reduce the bits for the frequency since 60 GHz isn't
supported.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200402011810.22947-5-thomas@adapt-ip.com
[fix commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Thomas Pedersen
934f4c7dd3 cfg80211: express channels with a KHz component
Some bands (S1G) define channels centered on a non-integer
MHz. Give ieee80211_channel and cfg80211_chan_def a
freq_offset component where the final frequency can be
expressed as:

MHZ_TO_KHZ(chan->center_freq) + chan->freq_offset;

Also provide some helper functions to do the frequency
conversion and test for equality.

Retain the existing interface to frequency and channel
conversion helpers, and expose new ones which handle
frequencies in units of KHz.

Some internal functions (net/wireless/chan.c) pass around
a frequency value. Convert these to units of KHz.

mesh, ibss, wext, etc. are currently ignored.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200402011810.22947-3-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Thomas Pedersen
b572510100 ieee80211: share 802.11 unit conversion helpers
MHZ_TO_KHZ, and KHZ_TO_MHZ are useful to drivers and
elsewhere so export these in the common ieee80211 header.
Move the power helpers also because we might as well.

Signed-off-by: Thomas Pedersen <thomas@adapt-ip.com>
Link: https://lore.kernel.org/r/20200402011810.22947-2-thomas@adapt-ip.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Mordechay Goodstein
0c197f16f7 mac80211: agg-tx: add an option to defer ADDBA transmit
Driver tells mac80211 to sends ADDBA with SSN (starting sequence number)
from the head of the queue, while the transmission of all the frames in the
queue may take a while, which causes the peer to time out. In order to
fix this scenario, add an option to defer ADDBA transmit until queue
is drained.

Signed-off-by: Mordechay Goodstein <mordechay.goodstein@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.0f27423fec75.If67daab123a27c1cbddef000d6a3f212aa6309ef@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:43 +02:00
Ilan Peer
2a392596d8 cfg80211: Parse HE membership selector
This extends the support for drivers that rebuilds IEs in the
FW (same as with HT/VHT).

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.20feaabfb484.I886252639604c8e3e84b8ef97962f1b0e4beec81@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:42 +02:00
Shaul Triebitz
d46b4ab870 mac80211: add twt_protected flag to the bss_conf structure
Add a flag to the BSS conf whether the BSS and STA support protected TWT.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/iwlwifi.20200326150855.1dcb2d16fa74.I74d7c007dad2601d2e39f54612fe6554dd5ab386@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:42 +02:00
Jouni Malinen
873b1cf611 mac80211: Process multicast RX registration for Action frames
Convert a user space registration for processing multicast Action frames
(NL80211_CMD_REGISTER_FRAME with NL80211_ATTR_RECEIVE_MULTICAST) to a
new enum ieee80211_filter_flags bit FIF_MCAST_ACTION so that drivers can
update their RX filter parameters appropriately, if needed.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200421144815.19175-1-jouni@codeaurora.org
[rename variables to rx_mcast_action_reg indicating action frames only]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:41 +02:00
Johannes Berg
9dba48a6ec cfg80211: support multicast RX registration
For DPP, there's a need to receive multicast action frames,
but many drivers need a special filter configuration for this.

Support announcing from userspace in the management registration
that multicast RX is required, with an extended feature flag if
the driver handles this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200417124013.c46238801048.Ib041d437ce0bff28a0c6d5dc915f68f1d8591002@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:41 +02:00
Johannes Berg
6cd536fe62 cfg80211: change internal management frame registration API
Almost all drivers below cfg80211 get the API wrong (except for
cfg80211) and are unable to cope with multiple registrations for
the same frame type, which is valid due to the match filter.
This seems to indicate the API is wrong, and we should maintain
the full information in cfg80211 instead of the drivers.

Change the API to no longer inform the driver about individual
registrations and unregistrations, but rather every time about
the entire state of the entire wiphy and single wdev, whenever
it may have changed. This also simplifies the code in cfg80211
as it no longer has to track exactly what was unregistered and
can free things immediately.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Arend van Spriel <arend.vanspriel@broadcom.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20200417124300.f47f3828afc8.I7f81ef59c2c5a340d7075fb3c6d0e08e8aeffe07@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:40 +02:00
Jouni Malinen
4d797fce78 cfg80211: Unprotected Beacon frame RX indication
Extend cfg80211_rx_unprot_mlme_mgmt() to cover indication of unprotected
Beacon frames in addition to the previously used Deauthentication and
Disassociation frames. The Beacon frame case is quite similar, but has
couple of exceptions: this is used both with fully unprotected and also
incorrectly protected frames and there is a rate limit on the events to
avoid unnecessary flooding netlink events in case something goes wrong.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200401142548.6990-1-jouni@codeaurora.org
[add missing kernel-doc]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 12:33:40 +02:00
Johannes Berg
6cb5f3ea46 mac80211: populate debugfs only after cfg80211 init
When fixing the initialization race, we neglected to account for
the fact that debugfs is initialized in wiphy_register(), and
some debugfs things went missing (or rather were rerooted to the
global debugfs root).

Fix this by adding debugfs entries only after wiphy_register().
This requires some changes in the rate control code since it
currently adds debugfs at alloc time, which can no longer be
done after the reordering.

Reported-by: Jouni Malinen <j@w1.fi>
Reported-by: kernel test robot <rong.a.chen@intel.com>
Reported-by: Hauke Mehrtens <hauke@hauke-m.de>
Reported-by: Felix Fietkau <nbd@nbd.name>
Cc: stable@vger.kernel.org
Fixes: 52e04b4ce5 ("mac80211: fix race in ieee80211_register_hw()")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Sumit Garg <sumit.garg@linaro.org>
Link: https://lore.kernel.org/r/20200423111344.0e00d3346f12.Iadc76a03a55093d94391fc672e996a458702875d@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-24 11:30:13 +02:00
Fernando Gont
b75326c201 ipv6: Honor all IPv6 PIO Valid Lifetime values
RFC4862 5.5.3 e) prevents received Router Advertisements from reducing
the Valid Lifetime of configured addresses to less than two hours, thus
preventing hosts from reacting to the information provided by a router
that has positive knowledge that a prefix has become invalid.

This patch makes hosts honor all Valid Lifetime values, as per
draft-gont-6man-slaac-renum-06, Section 4.2. This is meant to help
mitigate the problem discussed in draft-ietf-v6ops-slaac-renum.

Note: Attacks aiming at disabling an advertised prefix via a Valid
Lifetime of 0 are not really more harmful than other attacks
that can be performed via forged RA messages, such as those
aiming at completely disabling a next-hop router via an RA that
advertises a Router Lifetime of 0, or performing a Denial of
Service (DoS) attack by advertising illegitimate prefixes via
forged PIOs.  In scenarios where RA-based attacks are of concern,
proper mitigations such as RA-Guard [RFC6105] [RFC7113] should
be implemented.

Signed-off-by: Fernando Gont <fgont@si6networks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-23 12:29:21 -07:00
Ioana Ciornei
788f87ac60 xdp: export the DEV_MAP_BULK_SIZE macro
Export the DEV_MAP_BULK_SIZE macro to the header file so that drivers
can directly use it as the maximum number of xdp_frames received in the
.ndo_xdp_xmit() callback.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 20:11:29 -07:00
Maciej Żenczykowski
9175d3f388 ipv6: ndisc: RFC-ietf-6man-ra-pref64-09 is now published as RFC8781
See:
  https://www.rfc-editor.org/authors/rfc8781.txt

Cc: Erik Kline <ek@google.com>
Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Michael Haro <mharo@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Fixes: c24a77edc9 ("ipv6: ndisc: add support for 'PREF64' dns64 prefix identifier")
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-22 12:15:07 -07:00
Sabrina Dubroca
9f0cadc32d xfrm: espintcp: save and call old ->sk_destruct
When ESP encapsulation is enabled on a TCP socket, I'm replacing the
existing ->sk_destruct callback with espintcp_destruct. We still need to
call the old callback to perform the other cleanups when the socket is
destroyed. Save the old callback, and call it from espintcp_destruct.

Fixes: e27cca96cd ("xfrm: add espintcp (RFC 8229)")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-04-20 07:34:16 +02:00
Eric Dumazet
9bacd256f1 tcp: cache line align MAX_TCP_HEADER
TCP stack is dumb in how it cooks its output packets.

Depending on MAX_HEADER value, we might chose a bad ending point
for the headers.

If we align the end of TCP headers to cache line boundary, we
make sure to always use the smallest number of cache lines,
which always help.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-18 15:47:28 -07:00
David S. Miller
513a24ffb3 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-04-17

Here's the first bluetooth-next pull request for the 5.8 kernel:

 - Added debugfs option to control MITM flag usage during pairing
 - Added new BT_MODE socket option
 - Added support for Qualcom QCA6390 device
 - Added support for Realtek RTL8761B device
 - Added support for mSBC audio codec over USB endpoints
 - Added framework for Microsoft HCI vendor extensions
 - Added new Read Security Information management command
 - Fixes/cleanup to link layer privacy related code
 - Various other smaller cleanups & fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-17 10:48:46 -07:00
Marcel Holtmann
2eb71a3a68 Bluetooth: Clear HCI_LL_RPA_RESOLUTION flag on reset
When the controller is being reset or power cycled, then the flag
HCI_LL_RPA_RESOLUTION which indicates if controller based address
resolution is active needs to be also reset.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-15 16:51:07 +03:00
Marcel Holtmann
ff3b8df2bd Bluetooth: Enable LE Enhanced Connection Complete event.
In case LL Privacy is supported by the controller, it is also a good
idea to use the LE Enhanced Connection Complete event for getting all
information about the new connection and its addresses.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-15 16:51:05 +03:00
Marcel Holtmann
55beec1071 Bluetooth: Sort list of LE features constants
The list of LE features constants has gotten a bit confused. It lost the
order and gained duplicated. Clean this up.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-15 16:50:59 +03:00
Lothar Rubusch
a710d21451 cfg80211: fix kernel-doc notation
Update missing kernel-doc annotations and fix of related warnings
at 'make htmldocs'.

Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Link: https://lore.kernel.org/r/20200408231013.28370-1-l.rubusch@gmail.com
[fix indentation, attribute references]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-04-14 12:40:02 +02:00
Tim Stallard
03e2a984b6 net: ipv6: do not consider routes via gateways for anycast address check
The behaviour for what is considered an anycast address changed in
commit 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after
encountering pmtu exception"). This now considers the first
address in a subnet where there is a route via a gateway
to be an anycast address.

This breaks path MTU discovery and traceroutes when a host in a
remote network uses the address at the start of a prefix
(eg 2600:: advertised as 2600::/48 in the DFZ) as ICMP errors
will not be sent to anycast addresses.

This patch excludes any routes with a gateway, or via point to
point links, like the behaviour previously from
rt6_is_gw_or_nonexthop in net/ipv6/route.c.

This can be tested with:
ip link add v1 type veth peer name v2
ip netns add test
ip netns exec test ip link set lo up
ip link set v2 netns test
ip link set v1 up
ip netns exec test ip link set v2 up
ip addr add 2001:db8::1/64 dev v1 nodad
ip addr add 2001:db8:100:: dev lo nodad
ip netns exec test ip addr add 2001:db8::2/64 dev v2 nodad
ip netns exec test ip route add unreachable 2001:db8:1::1
ip netns exec test ip route add 2001:db8:100::/64 via 2001:db8::1
ip netns exec test sysctl net.ipv6.conf.all.forwarding=1
ip route add 2001:db8:1::1 via 2001:db8::2
ping -I 2001:db8::1 2001:db8:1::1 -c1
ping -I 2001:db8:100:: 2001:db8:1::1 -c1
ip addr delete 2001:db8:100:: dev lo
ip netns delete test

Currently the first ping will get back a destination unreachable ICMP
error, but the second will never get a response, with "icmp6_send:
acast source" logged. After this patch, both get destination
unreachable ICMP replies.

Fixes: 45e4fd2668 ("ipv6: Only create RTF_CACHE routes after encountering pmtu exception")
Signed-off-by: Tim Stallard <code@timstallard.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:26:52 -07:00
Lothar Rubusch
045065f06f net: sock.h: fix skb_steal_sock() kernel-doc
Fix warnings related to kernel-doc notation, and wording in
function description.

Signed-off-by: Lothar Rubusch <l.rubusch@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Tested-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:09:30 -07:00
David S. Miller
c2c1128902 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net, they are:

1) Fix spurious overlap condition in the rbtree tree, from Stefano Brivio.

2) Fix possible uninitialized pointer dereference in nft_lookup.

3) IDLETIMER v1 target matches the Android layout, from
   Maciej Zenczykowski.

4) Dangling pointer in nf_tables_set_alloc_name, from Eric Dumazet.

5) Fix RCU warning splat in ipset find_set_type(), from Amol Grover.

6) Report EOPNOTSUPP on unsupported set flags and object types in sets.

7) Add NFT_SET_CONCAT flag to provide consistent error reporting
   when users defines set with ranges in concatenations in old kernels.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-04-07 18:08:06 -07:00
Archie Pusaka
c2aa30db74 Bluetooth: debugfs option to unset MITM flag
The BT qualification test SM/MAS/PKE/BV-01-C needs us to turn off
the MITM flag when pairing, and at the same time also set the io
capability to something other than no input no output.

Currently the MITM flag is only unset when the io capability is set
to no input no output, therefore the test cannot be executed.

This patch introduces a debugfs option to force MITM flag to be
turned off.

Signed-off-by: Archie Pusaka <apusaka@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-07 18:32:21 +02:00
Linus Torvalds
e14679b62d 9p pull request for inclusion in 5.7
- Fix read with O_NONBLOCK to allow incomplete read and return immediately
 - Rest is just cleanup (indent, unused field in struct, extra semicolon)
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEE/IPbcYBuWt0zoYhOq06b7GqY5nAFAl6LCyEACgkQq06b7GqY
 5nAAQg//ThE/rojrB74S8yaD3sJlfsT+BoMzGDeos9FGdsHpO8t1ISUiheAcUn/Y
 a6igfJxTefxEO5SWqvsGEousRl0sRTyLo8RuQ2CC7+zKN8Ou+eG+4s2QXfDxBD+p
 xJGJrsWD2SkQ98/hkxDY6d31BM9iQGvpvVp/c9qLwtvTlnvEMXll0JiACcnmiPRt
 z92bx8QkaPlt399qv7LltfHftxCHOMcieQlYAF3wDjLauiEdDWysiMn0NIC2izwY
 hPlsVCiV504yhvQFcXVnmbWVJe88THlBVVjuVAi9IwpU8D2+1UK1kM/yiyVkYml+
 U4JC7ZBQoqw2obSLPk5mZ6E7nGWfWnIBpt/B9dO7akP8pS9bv+uCjwg65tFEop5/
 z9nuyjoQT/dEoThveeoEoxpIbyd690n1vbhcTdGDFDX5nfglrxj0S6bNVW0k6e41
 9aPwWzAfU9Ip9GHwE6Ewf/y88mMI7rCP+o/pC8XkUUGiV+foFt3oO5wCcvGbxwgJ
 nnZ5oY+Ren/QXOmq/cN3RUWJvTRbc8TDBm6Fa0iGkk7d1OUBuRTWuBUXvxHMpO3A
 xHapDRddPcrZ9QQSJZGvMSYwG0ZUZ6MRL2qTU4X/3sIi0giApFkRMpmmo0HjvMAf
 PIFl2MG2Ok4A8yWQN98AMXy8vgs6ql6+Wb3W0b5dNFsAGEk7f9U=
 =wAe5
 -----END PGP SIGNATURE-----

Merge tag '9p-for-5.7' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "Not much new, but a few patches for this cycle:

   - Fix read with O_NONBLOCK to allow incomplete read and return
     immediately

   - Rest is just cleanup (indent, unused field in struct, extra
     semicolon)"

* tag '9p-for-5.7' of git://github.com/martinetd/linux:
  net/9p: remove unused p9_req_t aux field
  9p: read only once on O_NONBLOCK
  9pnet: allow making incomplete read requests
  9p: Remove unneeded semicolon
  9p: Fix Kconfig indentation
2020-04-06 08:46:59 -07:00
Pablo Neira Ayuso
a26c1e49c8 netfilter: nf_tables: do not update stateful expressions if lookup is inverted
Initialize set lookup matching element to NULL. Otherwise, the
NFT_LOOKUP_F_INV flag reverses the matching logic and it leads to
deference an uninitialized pointer to the matching element. Make sure
element data area and stateful expression are accessed if there is a
matching set element.

This patch undoes 24791b9aa1 ("netfilter: nft_set_bitmap: initialize set
element extension in lookups") which is not required anymore.

Fixes: 339706bc21 ("netfilter: nft_lookup: update element stateful expression")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-04-05 23:26:36 +02:00
Marcel Holtmann
d2a3f5f463 Bluetooth: Add HCI device identifier for VIRTIO devices
This patch assigns the next free HCI device identifier to Bluetooth
devices based on VIRTIO devices.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:16 +03:00
Marcel Holtmann
bc292258c5 Bluetooth: Add support for reading security information
To allow userspace to make correcty security policy decision, the kernel
needs to export a few details of the supported security features and
encryption key size information. This command exports this information
and also allows future extensions if needed.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:13 +03:00
Marcel Holtmann
a479036041 Bluetooth: Add support for Read Local Simple Pairing Options
With the Read Local Simple Pairing Options command it is possible to
retrieve the support for max encryption key size supported by the
controller and also if the controller correctly verifies the ECDH public
key during pairing.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Reviewed-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:11 +03:00
Miao-chen Chou
145373cb1b Bluetooth: Add framework for Microsoft vendor extension
Micrsoft defined a set for HCI vendor extensions. Check the following
link for details:

https://docs.microsoft.com/en-us/windows-hardware/drivers/bluetooth/microsoft-defined-bluetooth-hci-commands-and-events

This provides the basic framework to enable the extension and read its
supported features. Drivers still have to declare support for this
extension before it can be utilized by the host stack.

Signed-off-by: Miao-chen Chou <mcchou@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:53:05 +03:00
Sathish Narsimman
1f8330ea16 Bluetooth: add support to notify using SCO air mode
notifying using HCI_NOTIFY_CONN_ADD for SCO connection is generic in
case of mSBC audio. To differntiate SCO air mode introducing
HCI_NOTIFY_ENABLE_SCO_CVSD and HCI_NOTIFY_ENABLE_SCO_TRANSP.

Signed-off-by: Sathish Narsimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-04-05 14:52:38 +03:00
Luiz Augusto von Dentz
3ee7b7cd83 Bluetooth: Add BT_MODE socket option
This adds BT_MODE socket option which can be used to set L2CAP modes,
including modes only supported over LE which were not supported using
the L2CAP_OPTIONS.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-04-02 08:25:19 +02:00
Linus Torvalds
29d9f30d4c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Fix the iwlwifi regression, from Johannes Berg.

   2) Support BSS coloring and 802.11 encapsulation offloading in
      hardware, from John Crispin.

   3) Fix some potential Spectre issues in qtnfmac, from Sergey
      Matyukevich.

   4) Add TTL decrement action to openvswitch, from Matteo Croce.

   5) Allow paralleization through flow_action setup by not taking the
      RTNL mutex, from Vlad Buslov.

   6) A lot of zero-length array to flexible-array conversions, from
      Gustavo A. R. Silva.

   7) Align XDP statistics names across several drivers for consistency,
      from Lorenzo Bianconi.

   8) Add various pieces of infrastructure for offloading conntrack, and
      make use of it in mlx5 driver, from Paul Blakey.

   9) Allow using listening sockets in BPF sockmap, from Jakub Sitnicki.

  10) Lots of parallelization improvements during configuration changes
      in mlxsw driver, from Ido Schimmel.

  11) Add support to devlink for generic packet traps, which report
      packets dropped during ACL processing. And use them in mlxsw
      driver. From Jiri Pirko.

  12) Support bcmgenet on ACPI, from Jeremy Linton.

  13) Make BPF compatible with RT, from Thomas Gleixnet, Alexei
      Starovoitov, and your's truly.

  14) Support XDP meta-data in virtio_net, from Yuya Kusakabe.

  15) Fix sysfs permissions when network devices change namespaces, from
      Christian Brauner.

  16) Add a flags element to ethtool_ops so that drivers can more simply
      indicate which coalescing parameters they actually support, and
      therefore the generic layer can validate the user's ethtool
      request. Use this in all drivers, from Jakub Kicinski.

  17) Offload FIFO qdisc in mlxsw, from Petr Machata.

  18) Support UDP sockets in sockmap, from Lorenz Bauer.

  19) Fix stretch ACK bugs in several TCP congestion control modules,
      from Pengcheng Yang.

  20) Support virtual functiosn in octeontx2 driver, from Tomasz
      Duszynski.

  21) Add region operations for devlink and use it in ice driver to dump
      NVM contents, from Jacob Keller.

  22) Add support for hw offload of MACSEC, from Antoine Tenart.

  23) Add support for BPF programs that can be attached to LSM hooks,
      from KP Singh.

  24) Support for multiple paths, path managers, and counters in MPTCP.
      From Peter Krystad, Paolo Abeni, Florian Westphal, Davide Caratti,
      and others.

  25) More progress on adding the netlink interface to ethtool, from
      Michal Kubecek"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2121 commits)
  net: ipv6: rpl_iptunnel: Fix potential memory leak in rpl_do_srh_inline
  cxgb4/chcr: nic-tls stats in ethtool
  net: dsa: fix oops while probing Marvell DSA switches
  net/bpfilter: remove superfluous testing message
  net: macb: Fix handling of fixed-link node
  net: dsa: ksz: Select KSZ protocol tag
  netdevsim: dev: Fix memory leak in nsim_dev_take_snapshot_write
  net: stmmac: add EHL 2.5Gbps PCI info and PCI ID
  net: stmmac: add EHL PSE0 & PSE1 1Gbps PCI info and PCI ID
  net: stmmac: create dwmac-intel.c to contain all Intel platform
  net: dsa: bcm_sf2: Support specifying VLAN tag egress rule
  net: dsa: bcm_sf2: Add support for matching VLAN TCI
  net: dsa: bcm_sf2: Move writing of CFP_DATA(5) into slicing functions
  net: dsa: bcm_sf2: Check earlier for FLOW_EXT and FLOW_MAC_EXT
  net: dsa: bcm_sf2: Disable learning for ASP port
  net: dsa: b53: Deny enslaving port 7 for 7278 into a bridge
  net: dsa: b53: Prevent tagged VLAN on port 7 for 7278
  net: dsa: b53: Restore VLAN entries upon (re)configuration
  net: dsa: bcm_sf2: Fix overflow checks
  hv_netvsc: Remove unnecessary round_up for recv_completion_cnt
  ...
2020-03-31 17:29:33 -07:00
Linus Torvalds
5b67fbfc32 Kbuild updates for v5.7
[Build system]
 
  - add CONFIG_UNUSED_KSYMS_WHITELIST, which will be useful to define
    a fixed set of export symbols for Generic Kernel Image (GKI)
 
  - allow to run 'make dt_binding_check' without .config
 
  - use full schema for checking DT examples in *.yaml files
 
  - make modpost fail for missing MODULE_IMPORT_NS(), which makes more
    sense because we know the produced modules are never loadable
 
  - Remove unused 'AS' variable
 
 [Kconfig]
 
  - sanitize DEFCONFIG_LIST, and remove ARCH_DEFCONFIG from Kconfig files
 
  - relax the 'imply' behavior so that symbols implied by y can become m
 
  - make 'imply' obey 'depends on' in order to make 'imply' really weak
 
 [Misc]
 
  - add documentation on building the kernel with Clang/LLVM
 
  - revive __HAVE_ARCH_STRLEN for 32bit sparc to use optimized strlen()
 
  - fix warning from deb-pkg builds when CONFIG_DEBUG_INFO=n
 
  - various script and Makefile cleanups
 -----BEGIN PGP SIGNATURE-----
 
 iQJJBAABCgAzFiEEbmPs18K1szRHjPqEPYsBB53g2wYFAl6DbP8VHG1hc2FoaXJv
 eUBrZXJuZWwub3JnAAoJED2LAQed4NsGAfkQALZqMCqtX9cAJej04+lnBCzwVPep
 6s8/s6vW6PF92sHv+SJtHvKSnDekcZT2xT8dkPDaVmuOye8xhENs5dFZ4tSKO5D0
 F8YkkM17mu/cylNZ2UCy/8weh6/TjsD7pa+mFqWo/++30JiXm12v3mVFR568KPXI
 kFau/3ALvY1NIr2wUAI2SOd6A4v/Epzpk0ltnFg3f5iWVFKlE03MGueAF+YZzq7v
 UrU73HdUxF/SBW2Jz3UtV9XY8P38uQmmtoDE8SZikG4PjW03q9w6pnhntDBl/H2b
 dZFg40eG7SHXN4L+OOI32ae9jePHvKpsnjeaeNoT/DZpwpuuxXu7C2EmUy+wCAnM
 Rw4+kiAVNppRMRH1GTdp1XjLY6PwPqizzZGmufwX+W3MI8oZdlLSUJLbrO73P/aF
 QR3MgkJkjvgmRVPP9fr8SNcZ39tDGI4KqLdWvjVVSC/s86aDnw/34puEfw0lj4vs
 gCi923iJQ7Y/QWX63TYZhy96pnedlwE2s6aR1InVER3+XMH9K1nW34CDaKQsp1CB
 6zyrd40+K5ETOKo3OAjq4FttlhRkEpX9nIsffCzOz6tybysHTSrCzYhfjpIAzzYj
 Et5HpXbegHShIqN44yqBumt6YkTZac6Aub9FzInW2LPzZgiofDaNesDQmnQmIZOa
 JlUyBrjXRfwkvCH0
 =wT8A
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:
 "Build system:

   - add CONFIG_UNUSED_KSYMS_WHITELIST, which will be useful to define a
     fixed set of export symbols for Generic Kernel Image (GKI)

   - allow to run 'make dt_binding_check' without .config

   - use full schema for checking DT examples in *.yaml files

   - make modpost fail for missing MODULE_IMPORT_NS(), which makes more
     sense because we know the produced modules are never loadable

   - Remove unused 'AS' variable

  Kconfig:

   - sanitize DEFCONFIG_LIST, and remove ARCH_DEFCONFIG from Kconfig
     files

   - relax the 'imply' behavior so that symbols implied by 'y' can
     become 'm'

   - make 'imply' obey 'depends on' in order to make 'imply' really weak

  Misc:

   - add documentation on building the kernel with Clang/LLVM

   - revive __HAVE_ARCH_STRLEN for 32bit sparc to use optimized strlen()

   - fix warning from deb-pkg builds when CONFIG_DEBUG_INFO=n

   - various script and Makefile cleanups"

* tag 'kbuild-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (34 commits)
  Makefile: Update kselftest help information
  kbuild: deb-pkg: fix warning when CONFIG_DEBUG_INFO is unset
  kbuild: add outputmakefile to no-dot-config-targets
  kbuild: remove AS variable
  net: wan: wanxl: refactor the firmware rebuild rule
  net: wan: wanxl: use $(M68KCC) instead of $(M68KAS) for rebuilding firmware
  net: wan: wanxl: use allow to pass CROSS_COMPILE_M68k for rebuilding firmware
  kbuild: add comment about grouped target
  kbuild: add -Wall to KBUILD_HOSTCXXFLAGS
  kconfig: remove unused variable in qconf.cc
  sparc: revive __HAVE_ARCH_STRLEN for 32bit sparc
  kbuild: refactor Makefile.dtbinst more
  kbuild: compute the dtbs_install destination more simply
  Makefile: disallow data races on gcc-10 as well
  kconfig: make 'imply' obey the direct dependency
  kconfig: allow symbols implied by y to become m
  net: drop_monitor: use IS_REACHABLE() to guard net_dm_hw_report()
  modpost: return error if module is missing ns imports and MODULE_ALLOW_MISSING_NAMESPACE_IMPORTS=n
  modpost: rework and consolidate logging interface
  kbuild: allow to run dt_binding_check without kernel configuration
  ...
2020-03-31 16:03:39 -07:00
David S. Miller
ed52f2c608 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 19:52:37 -07:00
David S. Miller
d9679cd985 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Add support to specify a stateful expression in set definitions,
   this allows users to specify e.g. counters per set elements.

2) Flowtable software counter support.

3) Flowtable hardware offload counter support, from wenxu.

3) Parallelize flowtable hardware offload requests, from Paul Blakey.
   This includes a patch to add one work entry per offload command.

4) Several patches to rework nf_queue refcount handling, from Florian
   Westphal.

4) A few fixes for the flowtable tunnel offload: Fix crash if tunneling
   information is missing and set up indirect flow block as TC_SETUP_FT,
   patch from wenxu.

5) Stricter netlink attribute sanity check on filters, from Romain Bellan
   and Florent Fourcot.

5) Annotations to make sparse happy, from Jules Irenge.

6) Improve icmp errors in debugging information, from Haishuang Yan.

7) Fix warning in IPVS icmp error debugging, from Haishuang Yan.

8) Fix endianess issue in tcp extension header, from Sergey Marinkevich.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 19:40:46 -07:00
Ido Schimmel
c064875a63 devlink: Allow setting of packet trap group parameters
The previous patch allowed device drivers to publish their default
binding between packet trap policers and packet trap groups. However,
some users might not be content with this binding and would like to
change it.

In case user space passed a packet trap policer identifier when setting
a packet trap group, invoke the appropriate device driver callback and
pass the new policer identifier.

v2:
* Check for presence of 'DEVLINK_ATTR_TRAP_POLICER_ID' in
  devlink_trap_group_set() and bail if not present
* Add extack error message in case trap group was partially modified

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 17:54:58 -07:00
Ido Schimmel
f9f54392d2 devlink: Add packet trap group parameters support
Packet trap groups are used to aggregate logically related packet traps.
Currently, these groups allow user space to batch operations such as
setting the trap action of all member traps.

In order to prevent the CPU from being overwhelmed by too many trapped
packets, it is desirable to bind a packet trap policer to these groups.
For example, to limit all the packets that encountered an exception
during routing to 10Kpps.

Allow device drivers to bind default packet trap policers to packet trap
groups when the latter are registered with devlink.

The next patch will enable user space to change this default binding.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 17:54:58 -07:00
Ido Schimmel
1e8c661989 devlink: Add packet trap policers support
Devices capable of offloading the kernel's datapath and perform
functions such as bridging and routing must also be able to send (trap)
specific packets to the kernel (i.e., the CPU) for processing.

For example, a device acting as a multicast-aware bridge must be able to
trap IGMP membership reports to the kernel for processing by the bridge
module.

In most cases, the underlying device is capable of handling packet rates
that are several orders of magnitude higher compared to those that can
be handled by the CPU.

Therefore, in order to prevent the underlying device from overwhelming
the CPU, devices usually include packet trap policers that are able to
police the trapped packets to rates that can be handled by the CPU.

This patch allows capable device drivers to register their supported
packet trap policers with devlink. User space can then tune the
parameters of these policer (currently, rate and burst size) and read
from the device the number of packets that were dropped by the policer,
if supported.

Subsequent patches in the series will allow device drivers to create
default binding between these policers and packet trap groups and allow
user space to change the binding.

v2:
* Add 'strict_start_type' in devlink policy
* Have device drivers provide max/min rate/burst size for each policer.
  Use them to check validity of user provided parameters

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 17:54:58 -07:00
Joe Stringer
7ae215d23c bpf: Don't refcount LISTEN sockets in sk_assign()
Avoid taking a reference on listen sockets by checking the socket type
in the sk_assign and in the corresponding skb_steal_sock() code in the
the transport layer, and by ensuring that the prefetch free (sock_pfree)
function uses the same logic to check whether the socket is refcounted.

Suggested-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-4-joe@wand.net.nz
2020-03-30 13:45:05 -07:00
Joe Stringer
71489e21d7 net: Track socket refcounts in skb_steal_sock()
Refactor the UDP/TCP handlers slightly to allow skb_steal_sock() to make
the determination of whether the socket is reference counted in the case
where it is prefetched by earlier logic such as early_demux.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-3-joe@wand.net.nz
2020-03-30 13:45:04 -07:00
Joe Stringer
cf7fbe660f bpf: Add socket assign support
Add support for TPROXY via a new bpf helper, bpf_sk_assign().

This helper requires the BPF program to discover the socket via a call
to bpf_sk*_lookup_*(), then pass this socket to the new helper. The
helper takes its own reference to the socket in addition to any existing
reference that may or may not currently be obtained for the duration of
BPF processing. For the destination socket to receive the traffic, the
traffic must be routed towards that socket via local route. The
simplest example route is below, but in practice you may want to route
traffic more narrowly (eg by CIDR):

  $ ip route add local default dev lo

This patch avoids trying to introduce an extra bit into the skb->sk, as
that would require more invasive changes to all code interacting with
the socket to ensure that the bit is handled correctly, such as all
error-handling cases along the path from the helper in BPF through to
the orphan path in the input. Instead, we opt to use the destructor
variable to switch on the prefetch of the socket.

Signed-off-by: Joe Stringer <joe@wand.net.nz>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200329225342.16317-2-joe@wand.net.nz
2020-03-30 13:45:04 -07:00
Linus Torvalds
e59cd88028 for-5.7/io_uring-2020-03-29
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAl6BJEMQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgpie7D/9gN4zhykYDfcgamfxMtTbpla2PdTnWoJxP
 fjy/Nx2FySakmccaiCGQSQ1rzD1L67UQkJgEH6hPTomJvA4FaOmJ+ZSaExMy55LH
 ZT+nD3zQ9SCuA0DEpfxbsCP1tbnoXSMQNt8Tyh0x8PAoxp5bI0eRczOju1QWLWTS
 tjBEMZNipN6krrV9RPWT0S5Z31/yGr/sXprCSHFV9Ypzwrx58Tj2i6F9gR7FVbLs
 nV2/O8taEn0sMQIz8TVHKol/TBalluGrC4M/bOeS3faP3BPN4TT24Gtc0LAKEibk
 F49/SX7FzwhOdl43Bdkbe2bbL86p+zOLSf0IMBwMm0DJl4aiOljRUYTSYRolgGgm
 Ebw9QhemTwbxxeD2nEriA4EAeYvTx69RDlN2eVilwwfJ48Xz9fVm3GNYG7LISeON
 k3/TyZOBQH2SZ2Hc3oF2Mq9j1UPHXZHUUsUNlNcN+aM9SFHcWkRi6xZWemTJHJZ4
 zFss5RZHo0+RLBa8rrx8xaO8iWrc73+FuRhr9eSsmyPIj+OZ4ezEFRRRHwtk2fgv
 dZvD413AyCI1c+3LlBusESMsrtXyY8p9O9buNTzHy3ZUtHe0ERmYV2m/a83A5pXo
 Kia/5aJbPIC61bAkCCkiVo+W9OASJ6o5+3CXl5sM9lGTbDXjcofzewmd+RHPestx
 xVbzeR9UIw==
 =bYLJ
 -----END PGP SIGNATURE-----

Merge tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block

Pull io_uring updates from Jens Axboe:
 "Here are the io_uring changes for this merge window. Light on new
  features this time around (just splice + buffer selection), lots of
  cleanups, fixes, and improvements to existing support. In particular,
  this contains:

   - Cleanup fixed file update handling for stack fallback (Hillf)

   - Re-work of how pollable async IO is handled, we no longer require
     thread offload to handle that. Instead we rely using poll to drive
     this, with task_work execution.

   - In conjunction with the above, allow expendable buffer selection,
     so that poll+recv (for example) no longer has to be a split
     operation.

   - Make sure we honor RLIMIT_FSIZE for buffered writes

   - Add support for splice (Pavel)

   - Linked work inheritance fixes and optimizations (Pavel)

   - Async work fixes and cleanups (Pavel)

   - Improve io-wq locking (Pavel)

   - Hashed link write improvements (Pavel)

   - SETUP_IOPOLL|SETUP_SQPOLL improvements (Xiaoguang)"

* tag 'for-5.7/io_uring-2020-03-29' of git://git.kernel.dk/linux-block: (54 commits)
  io_uring: cleanup io_alloc_async_ctx()
  io_uring: fix missing 'return' in comment
  io-wq: handle hashed writes in chains
  io-uring: drop 'free_pfile' in struct io_file_put
  io-uring: drop completion when removing file
  io_uring: Fix ->data corruption on re-enqueue
  io-wq: close cancel gap for hashed linked work
  io_uring: make spdxcheck.py happy
  io_uring: honor original task RLIMIT_FSIZE
  io-wq: hash dependent work
  io-wq: split hashing and enqueueing
  io-wq: don't resched if there is no work
  io-wq: remove duplicated cancel code
  io_uring: fix truncated async read/readv and write/writev retry
  io_uring: dual license io_uring.h uapi header
  io_uring: io_uring_enter(2) don't poll while SETUP_IOPOLL|SETUP_SQPOLL enabled
  io_uring: Fix unused function warnings
  io_uring: add end-of-bits marker and build time verify it
  io_uring: provide means of removing buffers
  io_uring: add IOSQE_BUFFER_SELECT support for IORING_OP_RECVMSG
  ...
2020-03-30 12:18:49 -07:00
David S. Miller
033c6f3b78 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-03-29

Here are a few more Bluetooth patches for the 5.7 kernel:

 - Fix assumption of encryption key size when reading fails
 - Add support for DEFER_SETUP with L2CAP Enhanced Credit Based Mode
 - Fix issue with auto-connected devices
 - Fix suspend handling when entering the state fails
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:49:14 -07:00
Vladimir Oltean
342971766c net: dsa: add port policers
The approach taken to pass the port policer methods on to drivers is
pragmatic. It is similar to the port mirroring implementation (in that
the DSA core does all of the filter block interaction and only passes
simple operations for the driver to implement) and dissimilar to how
flow-based policers are going to be implemented (where the driver has
full control over the flow_cls_offload data structure).

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:44:00 -07:00
Eran Ben Elisha
ba7d16c779 devlink: Implicitly set auto recover flag when registering health reporter
When health reporter is registered to devlink, devlink will implicitly set
auto recover if and only if the reporter has a recover method. No reason
to explicitly get the auto recover flag from the driver.

Remove this flag from all drivers that called
devlink_health_reporter_create.

All existing health reporters set auto recovery to true if they have a
recover method.

Yet, administrator can unset auto recover via netlink command as prior to
this patch.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:17:34 -07:00
Jiri Pirko
93a129eb8c net: sched: expose HW stats types per action used by drivers
It may be up to the driver (in case ANY HW stats is passed) to select
which type of HW stats he is going to use. Add an infrastructure to
expose this information to user.

$ tc filter add dev enp3s0np1 ingress proto ip handle 1 pref 1 flower dst_ip 192.168.1.1 action drop
$ tc -s filter show dev enp3s0np1 ingress
filter protocol ip pref 1 flower chain 0
filter protocol ip pref 1 flower chain 0 handle 0x1
  eth_type ipv4
  dst_ip 192.168.1.1
  in_hw in_hw_count 2
        action order 1: gact action drop
         random type none pass val 0
         index 1 ref 1 bind 1 installed 10 sec used 10 sec
        Action statistics:
        Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0)
        backlog 0b 0p requeues 0
        used_hw_stats immediate     <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:06:49 -07:00
Jiri Pirko
8953b0770f net: introduce nla_put_bitfield32() helper and use it
Introduce a helper to pass value and selector to. The helper packs them
into struct and puts them into netlink message.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 11:06:49 -07:00
David S. Miller
acc086bfb9 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-03-28

1) Use kmem_cache_zalloc() instead of kmem_cache_alloc()
   in xfrm_state_alloc(). From Huang Zijiang.

2) esp_output_fill_trailer() is the same in IPv4 and IPv6,
   so share this function to avoide code duplcation.
   From Raed Salem.

3) Add offload support for esp beet mode.
   From Xin Long.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-30 10:59:20 -07:00
Alexander Aring
a7a29f9c36 net: ipv6: add rpl sr tunnel
This patch adds functionality to configure routes for RPL source routing
functionality. There is no IPIP functionality yet implemented which can
be added later when the cases when to use IPv6 encapuslation comes more
clear.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Alexander Aring
faee676944 net: add net available in build_state
The build_state callback of lwtunnel doesn't contain the net namespace
structure yet. This patch will add it so we can check on specific
address configuration at creation time of rpl source routes.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Alexander Aring
8610c7c6e3 net: ipv6: add support for rpl sr exthdr
This patch adds rpl source routing receive handling. Everything works
only if sysconf "rpl_seg_enabled" and source routing is enabled. Mostly
the same behaviour as IPv6 segmentation routing. To handle compression
and uncompression a rpl.c file is created which contains the necessary
functionality. The receive handling will also care about IPv6
encapsulated so far it's specified as possible nexthdr in RFC 6554.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Alexander Aring
f37c605936 addrconf: add functionality to check on rpl requirements
This patch adds a functionality to addrconf to check on a specific RPL
address configuration. According to RFC 6554:

To detect loops in the SRH, a router MUST determine if the SRH
includes multiple addresses assigned to any interface on that
router. If such addresses appear more than once and are separated by
at least one address not assigned to that router.

Signed-off-by: Alexander Aring <alex.aring@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:30:57 -07:00
Florian Westphal
fc518953bc mptcp: add and use MIB counter infrastructure
Exported via same /proc file as the Linux TCP MIB counters, so "netstat -s"
or "nstat" will show them automatically.

The MPTCP MIB counters are allocated in a distinct pcpu area in order to
avoid bloating/wasting TCP pcpu memory.

Counters are allocated once the first MPTCP socket is created in a
network namespace and free'd on exit.

If no sockets have been allocated, all-zero mptcp counters are shown.

The MIB counter list is taken from the multipath-tcp.org kernel, but
only a few counters have been picked up so far.  The counter list can
be increased at any time later on.

v2 -> v3:
 - remove 'inline' in foo.c files (David S. Miller)

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:49 -07:00
Peter Krystad
ec3edaa7ca mptcp: Add handling of outgoing MP_JOIN requests
Subflow creation may be initiated by the path manager when
the primary connection is fully established and a remote
address has been received via ADD_ADDR.

Create an in-kernel sock and use kernel_connect() to
initiate connection.

Passive sockets can't acquire the mptcp socket lock at
subflow creation time, so an additional list protected by
a new spinlock is used to track the MPJ subflows.

Such list is spliced into conn_list tail every time the msk
socket lock is acquired, so that it will not interfere
with data flow on the original connection.

Data flow and connection failover not addressed by this commit.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
f296234c98 mptcp: Add handling of incoming MP_JOIN requests
Process the MP_JOIN option in a SYN packet with the same flow
as MP_CAPABLE but when the third ACK is received add the
subflow to the MPTCP socket subflow list instead of adding it to
the TCP socket accept queue.

The subflow is added at the end of the subflow list so it will not
interfere with the existing subflows operation and no data is
expected to be transmitted on it.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
Peter Krystad
3df523ab58 mptcp: Add ADD_ADDR handling
Add handling for sending and receiving the ADD_ADDR, ADD_ADDR6,
and RM_ADDR suboptions.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-29 22:14:48 -07:00
wenxu
9312eabab4 netfilter: conntrack: add nf_ct_acct_add()
Add nf_ct_acct_add function to update the conntrack counter
with packets and bytes.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:39 +02:00
Pablo Neira Ayuso
d56aab2625 netfilter: nf_tables: skip set types that do not support for expressions
The bitmap set does not support for expressions, skip it from the
estimation step.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-30 02:05:39 +02:00
Daniel Borkmann
5a95cbb80e bpf, net: Fix build issue when net ns not configured
Fix a redefinition of 'net_gen_cookie' error that was overlooked
when net ns is not configured.

Fixes: f318903c0b ("bpf: Add netns cookie and enable it for bpf cgroup hooks")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2020-03-30 01:16:50 +02:00
Florian Westphal
119e52e664 netfilter: nf_queue: place bridge physports into queue_entry struct
The refcount is done via entry->skb, which does work fine.
Major problem: When putting the refcount of the bridge ports, we
must always put the references while the skb is still around.

However, we will need to put the references after okfn() to avoid
a possible 1 -> 0 -> 1 refcount transition, so we cannot use the
skb pointer anymore.

Place the physports in the queue entry structure instead to allow
for refcounting changes in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29 16:28:29 +02:00
Florian Westphal
dd3cc111f2 netfilter: nf_queue: make nf_queue_entry_release_refs static
This is a preparation patch, no logical changes.
Move free_entry into core and rename it to something more sensible.

Will ease followup patches which will complicate the refcount handling.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-29 16:28:29 +02:00
Daniel Borkmann
5a52ae4e32 bpf: Allow to retrieve cgroup v1 classid from v2 hooks
Today, Kubernetes is still operating on cgroups v1, however, it is
possible to retrieve the task's classid based on 'current' out of
connect(), sendmsg(), recvmsg() and bind-related hooks for orchestrators
which attach to the root cgroup v2 hook in a mixed env like in case
of Cilium, for example, in order to then correlate certain pod traffic
and use it as part of the key for BPF map lookups.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/555e1c69db7376c0947007b4951c260e1074efc3.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
Daniel Borkmann
f318903c0b bpf: Add netns cookie and enable it for bpf cgroup hooks
In Cilium we're mainly using BPF cgroup hooks today in order to implement
kube-proxy free Kubernetes service translation for ClusterIP, NodePort (*),
ExternalIP, and LoadBalancer as well as HostPort mapping [0] for all traffic
between Cilium managed nodes. While this works in its current shape and avoids
packet-level NAT for inter Cilium managed node traffic, there is one major
limitation we're facing today, that is, lack of netns awareness.

In Kubernetes, the concept of Pods (which hold one or multiple containers)
has been built around network namespaces, so while we can use the global scope
of attaching to root BPF cgroup hooks also to our advantage (e.g. for exposing
NodePort ports on loopback addresses), we also have the need to differentiate
between initial network namespaces and non-initial one. For example, ExternalIP
services mandate that non-local service IPs are not to be translated from the
host (initial) network namespace as one example. Right now, we have an ugly
work-around in place where non-local service IPs for ExternalIP services are
not xlated from connect() and friends BPF hooks but instead via less efficient
packet-level NAT on the veth tc ingress hook for Pod traffic.

On top of determining whether we're in initial or non-initial network namespace
we also have a need for a socket-cookie like mechanism for network namespaces
scope. Socket cookies have the nice property that they can be combined as part
of the key structure e.g. for BPF LRU maps without having to worry that the
cookie could be recycled. We are planning to use this for our sessionAffinity
implementation for services. Therefore, add a new bpf_get_netns_cookie() helper
which would resolve both use cases at once: bpf_get_netns_cookie(NULL) would
provide the cookie for the initial network namespace while passing the context
instead of NULL would provide the cookie from the application's network namespace.
We're using a hole, so no size increase; the assignment happens only once.
Therefore this allows for a comparison on initial namespace as well as regular
cookie usage as we have today with socket cookies. We could later on enable
this helper for other program types as well as we would see need.

  (*) Both externalTrafficPolicy={Local|Cluster} types
  [0] https://github.com/cilium/cilium/blob/master/bpf/bpf_sock.c

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/c47d2346982693a9cf9da0e12690453aded4c788.1585323121.git.daniel@iogearbox.net
2020-03-27 19:40:38 -07:00
Vladimir Oltean
bff33f7e2a net: dsa: implement auto-normalization of MTU for bridge hardware datapath
Many switches don't have an explicit knob for configuring the MTU
(maximum transmission unit per interface).  Instead, they do the
length-based packet admission checks on the ingress interface, for
reasons that are easy to understand (why would you accept a packet in
the queuing subsystem if you know you're going to drop it anyway).

So it is actually the MRU that these switches permit configuring.

In Linux there only exists the IFLA_MTU netlink attribute and the
associated dev_set_mtu function. The comments like to play blind and say
that it's changing the "maximum transfer unit", which is to say that
there isn't any directionality in the meaning of the MTU word. So that
is the interpretation that this patch is giving to things: MTU == MRU.

When 2 interfaces having different MTUs are bridged, the bridge driver
MTU auto-adjustment logic kicks in: what br_mtu_auto_adjust() does is it
adjusts the MTU of the bridge net device itself (and not that of the
slave net devices) to the minimum value of all slave interfaces, in
order for forwarded packets to not exceed the MTU regardless of the
interface they are received and send on.

The idea behind this behavior, and why the slave MTUs are not adjusted,
is that normal termination from Linux over the L2 forwarding domain
should happen over the bridge net device, which _is_ properly limited by
the minimum MTU. And termination over individual slave devices is
possible even if those are bridged. But that is not "forwarding", so
there's no reason to do normalization there, since only a single
interface sees that packet.

The problem with those switches that can only control the MRU is with
the offloaded data path, where a packet received on an interface with
MRU 9000 would still be forwarded to an interface with MRU 1500. And the
br_mtu_auto_adjust() function does not really help, since the MTU
configured on the bridge net device is ignored.

In order to enforce the de-facto MTU == MRU rule for these switches, we
need to do MTU normalization, which means: in order for no packet larger
than the MTU configured on this port to be sent, then we need to limit
the MRU on all ports that this packet could possibly come from. AKA
since we are configuring the MRU via MTU, it means that all ports within
a bridge forwarding domain should have the same MTU.

And that is exactly what this patch is trying to do.

>From an implementation perspective, we try to follow the intent of the
user, otherwise there is a risk that we might livelock them (they try to
change the MTU on an already-bridged interface, but we just keep
changing it back in an attempt to keep the MTU normalized). So the MTU
that the bridge is normalized to is either:

 - The most recently changed one:

   ip link set dev swp0 master br0
   ip link set dev swp1 master br0
   ip link set dev swp0 mtu 1400

   This sequence will make swp1 inherit MTU 1400 from swp0.

 - The one of the most recently added interface to the bridge:

   ip link set dev swp0 master br0
   ip link set dev swp1 mtu 1400
   ip link set dev swp1 master br0

   The above sequence will make swp0 inherit MTU 1400 as well.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:07:25 -07:00
Vladimir Oltean
bfcb813203 net: dsa: configure the MTU for switch ports
It is useful be able to configure port policers on a switch to accept
frames of various sizes:

- Increase the MTU for better throughput from the default of 1500 if it
  is known that there is no 10/100 Mbps device in the network.
- Decrease the MTU to limit the latency of high-priority frames under
  congestion, or work around various network segments that add extra
  headers to packets which can't be fragmented.

For DSA slave ports, this is mostly a pass-through callback, called
through the regular ndo ops and at probe time (to ensure consistency
across all supported switches).

The CPU port is called with an MTU equal to the largest configured MTU
of the slave ports. The assumption is that the user might want to
sustain a bidirectional conversation with a partner over any switch
port.

The DSA master is configured the same as the CPU port, plus the tagger
overhead. Since the MTU is by definition L2 payload (sans Ethernet
header), it is up to each individual driver to figure out if it needs to
do anything special for its frame tags on the CPU port (it shouldn't
except in special cases). So the MTU does not contain the tagger
overhead on the CPU port.
However the MTU of the DSA master, minus the tagger overhead, is used as
a proxy for the MTU of the CPU port, which does not have a net device.
This is to avoid uselessly calling the .change_mtu function on the CPU
port when nothing should change.

So it is safe to assume that the DSA master and the CPU port MTUs are
apart by exactly the tagger's overhead in bytes.

Some changes were made around dsa_master_set_mtu(), function which was
now removed, for 2 reasons:
  - dev_set_mtu() already calls dev_validate_mtu(), so it's redundant to
    do the same thing in DSA
  - __dev_set_mtu() returns 0 if ops->ndo_change_mtu is an absent method
That is to say, there's no need for this function in DSA, we can safely
call dev_set_mtu() directly, take the rtnl lock when necessary, and just
propagate whatever errors get reported (since the user probably wants to
be informed).

Some inspiration (mainly in the MTU DSA notifier) was taken from a
vaguely similar patch from Murali and Florian, who are credited as
co-developers down below.

Co-developed-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Signed-off-by: Murali Krishna Policharla <murali.policharla@broadcom.com>
Co-developed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 16:07:24 -07:00
Vasundhara Volam
2d9eade8f2 devlink: Add macro for "fw.mgmt.api" to info_get cb.
Add definition and documentation for the new generic info
"fw.mgmt.api". This macro specifies the version of the software
interfaces between driver and firmware.

Cc: Jakub Kicinski <kuba@kernel.org>
Cc: Jacob Keller <jacob.e.keller@intel.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-27 15:34:42 -07:00
Paul Blakey
422c032afc netfilter: flowtable: Use rw sem as flow block lock
Currently flow offload threads are synchronized by the flow block mutex.
Use rw lock instead to increase flow insertion (read) concurrency.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:42:20 +01:00
wenxu
133a2fe594 netfilter: flowtable: Fix incorrect tc_setup_type type
The indirect block setup should use TC_SETUP_FT as the type instead of
TC_SETUP_BLOCK. Adjust existing users of the indirect flow block
infrastructure.

Fixes: b5140a36da ("netfilter: flowtable: add indr block setup support")
Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:41:52 +01:00
Pablo Neira Ayuso
53c2b2899a netfilter: flowtable: add counter support
Add a new flag to turn on flowtable counters which are stored in the
conntrack entry.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:32:37 +01:00
Pablo Neira Ayuso
cfbd1125fc netfilter: nf_tables: add enum nft_flowtable_flags to uapi
Expose the NFT_FLOWTABLE_HW_OFFLOAD flag through uapi.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:32:36 +01:00
Pablo Neira Ayuso
8ac2bd3577 netfilter: conntrack: export nf_ct_acct_update()
This function allows you to update the conntrack counters.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-27 18:32:36 +01:00
Dominique Martinet
43657496e4 net/9p: remove unused p9_req_t aux field
The p9_req_t field 'aux' has not been used in a very long time,
remove leftover field declaration

Link: http://lkml.kernel.org/r/1580941152-12973-1-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2020-03-27 09:29:57 +00:00
Sergey Alirzaev
388f6966b0 9pnet: allow making incomplete read requests
A user doesn't necessarily want to wait for all the requested data to
be available, since the waiting time for each request is unbounded.

The new method permits sending one read request at a time and getting
the response ASAP, allowing to use 9pnet with synthetic file systems
representing arbitrary data streams.

Link: http://lkml.kernel.org/r/20200205204053.12751-1-l29ah@cock.li
Signed-off-by: Sergey Alirzaev <l29ah@cock.li>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2020-03-27 09:29:56 +00:00
Dmitry Bogdanov
b62c362450 net: macsec: add support for getting offloaded stats
When HW offloading is enabled, offloaded stats should be used, because
s/w stats are wrong and out of sync with the HW in this case.

Signed-off-by: Dmitry Bogdanov <dbogdanov@marvell.com>
Signed-off-by: Mark Starovoytov <mstarovoitov@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 20:17:36 -07:00
Antoine Tenart
8fa9137180 net: macsec: allow to reference a netdev from a MACsec context
This patch allows to reference a net_device from a MACsec context. This
is needed to allow implementing MACsec operations in net device drivers.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: Mark Starovoytov <mstarovoitov@marvell.com>
Signed-off-by: Igor Russkikh <irusskikh@marvell.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 20:17:36 -07:00
Maciej Żenczykowski
c24a77edc9 ipv6: ndisc: add support for 'PREF64' dns64 prefix identifier
This is trivial since we already have support for the entirely
identical (from the kernel's point of view) RDNSS, DNSSL, etc. that
also contain opaque data that needs to be passed down to userspace
for further processing.

As specified in draft-ietf-6man-ra-pref64-09 (while it is still a draft,
it is purely waiting on the RFC Editor for cleanups and publishing):
  PREF64 option contains lifetime and a (up to) 96-bit IPv6 prefix.

The 8-bit identifier of the option type as assigned by the IANA is 38.

Since we lack DNS64/NAT64/CLAT support in kernel at the moment,
thus this option should also be passed on to userland.

See:
  https://tools.ietf.org/html/draft-ietf-6man-ra-pref64-09
  https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml#icmpv6-parameters-5

Cc: Erik Kline <ek@google.com>
Cc: Jen Linkova <furry@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Michael Haro <mharo@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Acked-By: Lorenzo Colitti <lorenzo@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 20:05:58 -07:00
Guillaume Nault
e4a58ef3ce net: sched: refine extack messages in tcf_change_indev
Add an error message when device wasn't found.
While there, also set the bad attribute's offset in extack.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:52:31 -07:00
Jacob Keller
b9a17abfde devlink: implement DEVLINK_CMD_REGION_NEW
Implement support for the DEVLINK_CMD_REGION_NEW command for creating
snapshots. This new command parallels the existing
DEVLINK_CMD_REGION_DEL.

In order for DEVLINK_CMD_REGION_NEW to work for a region, the new
".snapshot" operation must be implemented in the region's ops structure.

The desired snapshot id must be provided. This helps avoid confusion on
the purpose of DEVLINK_CMD_REGION_NEW, and keeps the API simpler.

The requested id will be inserted into the xarray tracking the number of
snapshots using each id. If this id is already used by another snapshot
on any region, an error will be returned.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
12102436ac devlink: track snapshot id usage count using an xarray
Each snapshot created for a devlink region must have an id. These ids
are supposed to be unique per "event" that caused the snapshot to be
created. Drivers call devlink_region_snapshot_id_get to obtain a new id
to use for a new event trigger. The id values are tracked per devlink,
so that the same id number can be used if a triggering event creates
multiple snapshots on different regions.

There is no mechanism for snapshot ids to ever be reused. Introduce an
xarray to store the count of how many snapshots are using a given id,
replacing the snapshot_id field previously used for picking the next id.

The devlink_region_snapshot_id_get() function will use xa_alloc to
insert an initial value of 1 value at an available slot between 0 and
U32_MAX.

The new __devlink_snapshot_id_increment() and
__devlink_snapshot_id_decrement() functions will be used to track how
many snapshots currently use an id.

Drivers must now call devlink_snapshot_id_put() in order to release
their reference of the snapshot id after adding region snapshots.

By tracking the total number of snapshots using a given id, it is
possible for the decrement() function to erase the id from the xarray
when it is not in use.

With this method, a snapshot id can become reused again once all
snapshots that referred to it have been deleted via
DEVLINK_CMD_REGION_DEL, and the driver has finished adding snapshots.

This work also paves the way to introduce a mechanism for userspace to
request a snapshot.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
7ef19d3b1d devlink: report error once U32_MAX snapshot ids have been used
The devlink_snapshot_id_get() function returns a snapshot id. The
snapshot id is a u32, so there is no way to indicate an error code.

A future change is going to possibly add additional cases where this
function could fail. Refactor the function to return the snapshot id in
an argument, so that it can return zero or an error value.

This ensures that snapshot ids cannot be confused with error values, and
aids in the future refactor of snapshot id allocation management.

Because there is no current way to release previously used snapshot ids,
add a simple check ensuring that an error is reported in case the
snapshot_id would over flow.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
a0a09f6bb2 devlink: convert snapshot destructor callback to region op
It does not makes sense that two snapshots for a given region would use
different destructors. Simplify snapshot creation by adding
a .destructor op for regions.

This operation will replace the data_destructor for the snapshot
creation, and makes snapshot creation easier.

Noticed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Jacob Keller
e893768179 devlink: prepare to support region operations
Modify the devlink region code in preparation for adding new operations
on regions.

Create a devlink_region_ops structure, and move the name pointer from
within the devlink_region structure into the ops structure (similar to
the devlink_health_reporter_ops).

This prepares the regions to enable support of additional operations in
the future such as requesting snapshots, or accessing the region
directly without a snapshot.

In order to re-use the constant strings in the mlx4 driver their
declaration must be changed to 'const char * const' to ensure the
compiler realizes that both the data and the pointer cannot change.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 19:39:26 -07:00
Petr Machata
1f40be6a34 net: flow_offload.h: Fix a comment at flow_action_entry.mangle
This field references FLOW_ACTION_PACKET_EDIT. Such action does not exist
though. Instead the field is used for FLOW_ACTION_MANGLE and _ADD.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-26 11:55:40 -07:00
David S. Miller
9fb16955fb Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Overlapping header include additions in macsec.c

A bug fix in 'net' overlapping with the removal of 'version'
string in ena_netdev.c

Overlapping test additions in selftests Makefile

Overlapping PCI ID table adjustments in iwlwifi driver.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25 18:58:11 -07:00
Luiz Augusto von Dentz
da49b602f7 Bluetooth: L2CAP: Use DEFER_SETUP to group ECRED connections
This uses the DEFER_SETUP flag to group channels with
L2CAP_CREDIT_BASED_CONNECTION_REQ.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-25 22:16:08 +01:00
Pablo Neira Ayuso
2c64605b59 net: Fix CONFIG_NET_CLS_ACT=n and CONFIG_NFT_FWD_NETDEV={y, m} build
net/netfilter/nft_fwd_netdev.c: In function ‘nft_fwd_netdev_eval’:
    net/netfilter/nft_fwd_netdev.c:32:10: error: ‘struct sk_buff’ has no member named ‘tc_redirected’
      pkt->skb->tc_redirected = 1;
              ^~
    net/netfilter/nft_fwd_netdev.c:33:10: error: ‘struct sk_buff’ has no member named ‘tc_from_ingress’
      pkt->skb->tc_from_ingress = 1;
              ^~

To avoid a direct dependency with tc actions from netfilter, wrap the
redirect bits around CONFIG_NET_REDIRECT and move helpers to
include/linux/skbuff.h. Turn on this toggle from the ifb driver, the
only existing client of these bits in the tree.

This patch adds skb_set_redirected() that sets on the redirected bit
on the skbuff, it specifies if the packet was redirect from ingress
and resets the timestamp (timestamp reset was originally missing in the
netfilter bugfix).

Fixes: bcfabee1af ("netfilter: nft_fwd_netdev: allow to redirect to ifb via ingress")
Reported-by: noreply@ellerman.id.au
Reported-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-25 12:24:33 -07:00
Luiz Augusto von Dentz
b48596d1dc Bluetooth: L2CAP: Add get_peer_pid callback
This adds a callback to read the socket pid.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-24 09:48:00 +01:00
David Laight
af13b3c338 Remove DST_HOST
Previous changes to the IP routing code have removed all the
tests for the DS_HOST route flag.
Remove the flags and all the code that sets it.

Signed-off-by: David Laight <david.laight@aculab.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:57:44 -07:00
Ido Schimmel
107f167894 devlink: Only pass packet trap group identifier in trap structure
Packet trap groups are now explicitly registered by drivers and not
implicitly registered when the packet traps are registered. Therefore,
there is no need to encode entire group structure the trap is associated
with inside the trap structure.

Instead, only pass the group identifier. Refer to it as initial group
identifier, as future patches will allow user space to move traps
between groups.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:40:40 -07:00
Ido Schimmel
95ad9555b5 devlink: Add API to register packet trap groups
Currently, packet trap groups are implicitly registered by drivers upon
packet trap registration. When the traps are registered, each is
associated with a group and the group is created by devlink, if it does
not exist already.

This makes it difficult for drivers to pass additional attributes for
the groups.

Therefore, as a preparation for future patches that require passing
additional group attributes, add an API to explicitly register /
unregister these groups.

Next patches will convert existing drivers to use this API.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:40:40 -07:00
David S. Miller
adbea1a5f5 Merge branch '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/next-queue
Jeff Kirsher says:

====================
100GbE Intel Wired LAN Driver Updates 2020-03-21

Implement basic support for the devlink interface in the ice driver.
Additionally pave some necessary changes for adding a devlink region that
exposes the NVM contents.

This series first contains 5 patches for enabling and implementing full NVM
read access via the ETHTOOL_GEEPROM interface. This includes some cleanup of
endian-types, a new function for reading from the NVM and Shadow RAM as a flat
addressable space, a function to calculate the available flash size during
load, and a change to how some of the NVM version fields are stored in the
ice_nvm_info structure.

Following this is 3 patches for implementing devlink support. First, one patch
which implements the basic framework and introduces the ice_devlink.c file.
Second, a patch to implement basic .info_get support. Finally, a patch which
reads the device PBA identifier and reports it as the `board.id` value in the
.info_get response.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 21:21:33 -07:00
Jakub Kicinski
0dfb2d82af net: sched: rename more stats_types
Commit 53eca1f347 ("net: rename flow_action_hw_stats_types* ->
flow_action_hw_stats*") renamed just the flow action types and
helpers. For consistency rename variables, enums, struct members
and UAPI too (note that this UAPI was not in any official release,
yet).

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-23 20:54:23 -07:00
Jacob Keller
c90977a3c2 devlink: promote "fw.bundle_id" to a generic info version
The nfp driver uses ``fw.bundle_id`` to represent a unique identifier of the
entire firmware bundle.

A future change is going to introduce a similar notion in the ice
driver, so promote ``fw.bundle_id`` into a generic version now.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Tested-by: Andrew Bowers <andrewx.bowers@intel.com>
Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com>
2020-03-21 00:57:16 -07:00
David S. Miller
0d7043f355 Another set of changes:
* HE ranging (fine timing measurement) API support
  * hwsim gets virtio support, for use with wmediumd,
    to be able to simulate with multiple machines
  * eapol-over-nl80211 improvements to exclude preauth
  * IBSS reset support, to recover connections from
    userspace
  * and various others.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl50yM8ACgkQB8qZga/f
 l8Qe0w//Rt5vsqlkiuZtg1SHQcMHq5Y6jJ92Xfg6Yw8H3H1g40r6UWDeSsF7Dl9H
 YuC4VvfNH/FhnNz01FDVxjgjnUyjpAolxM6sYj3hW3ZIBahSBeovAdjTP59y1Fhi
 zLMP/zYLsqtlKO5qr0fK7tUmENiF/IxyALRcxqsmQ1ul5003pIpFRVPoQVDTjHZn
 JCzDTboQLmtyYiDYjejaeauo4Awjj4lDiyk3Fu+kXSv8weY4dl+w8/AM5zsiVNDs
 4HRDpJUBIwfwWuygMNwl0EQVhnpYaNQFQ+IEe9B29qiAEb95MPHhSudU1qaT+4d7
 CykikhH39/n4Wv1PBg4PrCfWInxAUiPz6K027lDOY5VtJFFbLg8GqQkn3Vy/WnHm
 CJuVaLdJ73k7AypvWdZUgSe9SaloymVd987IIMixmz0t0cqIE3kPYot//kXFjy6m
 rqvWJ3a7SIopPdkD4xXDwzASUmctCUedQ0QVnWYRQCUW5rnkZ96SHjblUA9IykYS
 Ia44BwB8ggZsax1KrqyGV9hgj8mh9srLmEjtRPhxVD3ncE1ewiQaBRwv56J0EKNu
 yRnV10dNMCtUyT3j9dRpHa8zuNpS/A6ShXe4aGgoXAz+ShjNOBj2GxXtolsDJzrd
 zIqZH8gcWaeV+8yWhaGx8d82gvY8PhRvJl+AMax7Ds2avOZ/pa0=
 =t3QF
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Another set of changes:
 * HE ranging (fine timing measurement) API support
 * hwsim gets virtio support, for use with wmediumd,
   to be able to simulate with multiple machines
 * eapol-over-nl80211 improvements to exclude preauth
 * IBSS reset support, to recover connections from
   userspace
 * and various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-20 08:57:38 -07:00
YueHaibing
c0fd336ea4 bpf, tcp: Make tcp_bpf_recvmsg static
After commit f747632b60 ("bpf: sockmap: Move generic sockmap
hooks from BPF TCP"), tcp_bpf_recvmsg() is not used out of
tcp_bpf.c, so make it static and remove it from tcp.h. Also move
it to BPF_STREAM_PARSER #ifdef to fix unused function warnings.

Fixes: f747632b60 ("bpf: sockmap: Move generic sockmap hooks from BPF TCP")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200320023426.60684-3-yuehaibing@huawei.com
2020-03-20 15:56:55 +01:00
Veerendranath Jakkam
7fc82af856 cfg80211: Configure PMK lifetime and reauth threshold for PMKSA entries
Drivers that trigger roaming need to know the lifetime of the configured
PMKSA for deciding whether to trigger the full or PMKSA cache based
authentication. The configured PMKSA is invalid after the PMK lifetime
has expired and must not be used after that and the STA needs to
disassociate if the PMK expires. Hence the STA is expected to refresh
the PMK with a full authentication before this happens (e.g., when
reassociating to a new BSS the next time or by performing EAPOL
reauthentication depending on the AKM) to avoid unnecessary
disconnection.

The PMK reauthentication threshold is the percentage of the PMK lifetime
value and indicates to the driver to trigger a full authentication roam
(without PMKSA caching) after the reauthentication threshold time, but
before the PMK timer has expired. Authentication methods like SAE need
to be able to generate a new PMKSA entry without having to force a
disconnection after this threshold timeout. If no roaming occurs between
the reauthentication threshold time and PMK lifetime expiration,
disassociation is still forced.

The new attributes for providing these values correspond to the dot11
MIB variables dot11RSNAConfigPMKLifetime and
dot11RSNAConfigPMKReauthThreshold.

This type of functionality is already available in cases where user
space component is in control of roaming. This commit extends that same
capability into cases where parts or all of this functionality is
offloaded to the driver.

Signed-off-by: Veerendranath Jakkam <vjakkam@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200312235903.18462-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:20 +01:00
Shaul Triebitz
7e8d6f12bb nl80211: pass HE operation element to the driver
Pass the AP's HE operation element to the driver.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-18-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Avraham Stern
efb5520d0e nl80211/cfg80211: add support for non EDCA based ranging measurement
Add support for requesting that the ranging measurement will use
the trigger-based / non trigger-based flow instead of the EDCA based
flow.

Signed-off-by: Avraham Stern <avraham.stern@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200131111300.891737-2-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Qiujun Huang
07e9733886 mac80211: update documentation about tx power
The structure member added at some point, but the kernel-doc was not
updated.

Signed-off-by: Qiujun Huang <hqjagain@gmail.com>
Link: https://lore.kernel.org/r/20200312144424.3023-1-hqjagain@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Markus Theil
7f3f96cedd mac80211: handle no-preauth flag for control port
This patch adds support for disabling pre-auth rx over the nl80211 control
port for mac80211.

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200312091055.54257-3-markus.theil@tu-ilmenau.de
[fix indentation slightly, squash feature enablement]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:19 +01:00
Johannes Berg
1f7e9f46c2 cfg80211: fix documentation format
Kernel-doc complains if the line isn't prefixed with an
asterisk, fix that.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200320144110.2786ad5fb234.I369d103d11c71e39e3a3f97ed68a528c5b875f1e@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-03-20 14:42:12 +01:00
David S. Miller
43861da75e Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-03-19

Here's the main bluetooth-next pull request for the 5.7 kernel.

 - Added wideband speech support to mgmt and the ability for HCI drivers
   to declare support for it.
 - Added initial support for L2CAP Enhanced Credit Based Mode
 - Fixed suspend handling for several use cases
 - Fixed Extended Advertising related issues
 - Added support for Realtek 8822CE device
 - Added DT bindings for QTI chip WCN3991
 - Cleanups to replace zero-length arrays with flexible-array members
 - Several other smaller cleanups & fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:33:27 -07:00
Petr Machata
2ce124109c net: tc_skbedit: Make the skbedit priority offloadable
The skbedit action "priority" is used for adjusting SKB priority. Allow
drivers to offload the action by introducing two new skbedit getters and a
new flow action, and initializing appropriately in tc_setup_flow_action().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:09:19 -07:00
Petr Machata
fe93f0b225 net: tc_skbedit: Factor a helper out of is_tcf_skbedit_{mark, ptype}()
The two functions is_tcf_skbedit_mark() and is_tcf_skbedit_ptype() have a
very similar structure. A follow-up patch will add one more such function.
Instead of more cut'n'pasting, extract a helper function that checks
whether a TC action is an skbedit with the required flag. Convert the two
existing functions into thin wrappers around the helper.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 21:09:19 -07:00
Ido Schimmel
3ebaf6da07 net: sched: Do not assume RTNL is held in tunnel key action helpers
The cited commit removed RTNL from tc_setup_flow_action(), but the
function calls two tunnel key action helpers that use rtnl_dereference()
to fetch the action's parameters. This leads to "suspicious RCU usage"
warnings [1][2].

Change the helpers to use rcu_dereference_protected() while requiring
the action's lock to be held. This is safe because the two helpers are
only called from tc_setup_flow_action() which acquires the lock.

[1]
[  156.950855] =============================
[  156.955463] WARNING: suspicious RCU usage
[  156.960085] 5.6.0-rc5-custom-47426-gdfe43878d573 #2409 Not tainted
[  156.967116] -----------------------------
[  156.971728] include/net/tc_act/tc_tunnel_key.h:31 suspicious rcu_dereference_protected() usage!
[  156.981583]
[  156.981583] other info that might help us debug this:
[  156.981583]
[  156.990675]
[  156.990675] rcu_scheduler_active = 2, debug_locks = 1
[  156.998205] 1 lock held by tc/877:
[  157.002187]  #0: ffff8881cbf7bea0 (&(&p->tcfa_lock)->rlock){+...}, at: tc_setup_flow_action+0xbe/0x4f78
[  157.012866]
[  157.012866] stack backtrace:
[  157.017886] CPU: 2 PID: 877 Comm: tc Not tainted 5.6.0-rc5-custom-47426-gdfe43878d573 #2409
[  157.027253] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  157.037389] Call Trace:
[  157.040170]  dump_stack+0xfd/0x178
[  157.044034]  lockdep_rcu_suspicious+0x14a/0x153
[  157.049157]  tc_setup_flow_action+0x89f/0x4f78
[  157.054227]  fl_hw_replace_filter+0x375/0x640
[  157.064348]  fl_change+0x28ec/0x4f6b
[  157.088843]  tc_new_tfilter+0x15e2/0x2260
[  157.176801]  rtnetlink_rcv_msg+0x8d6/0xb60
[  157.190915]  netlink_rcv_skb+0x177/0x460
[  157.208884]  rtnetlink_rcv+0x21/0x30
[  157.212925]  netlink_unicast+0x5d0/0x7f0
[  157.227728]  netlink_sendmsg+0x981/0xe90
[  157.245416]  ____sys_sendmsg+0x76d/0x8f0
[  157.255348]  ___sys_sendmsg+0x10f/0x190
[  157.320308]  __sys_sendmsg+0x115/0x1f0
[  157.342553]  __x64_sys_sendmsg+0x7d/0xc0
[  157.346987]  do_syscall_64+0xc1/0x600
[  157.351142]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[2]
[  157.432346] =============================
[  157.436937] WARNING: suspicious RCU usage
[  157.441537] 5.6.0-rc5-custom-47426-gdfe43878d573 #2409 Not tainted
[  157.448559] -----------------------------
[  157.453204] include/net/tc_act/tc_tunnel_key.h:43 suspicious rcu_dereference_protected() usage!
[  157.463042]
[  157.463042] other info that might help us debug this:
[  157.463042]
[  157.472112]
[  157.472112] rcu_scheduler_active = 2, debug_locks = 1
[  157.479529] 1 lock held by tc/877:
[  157.483442]  #0: ffff8881cbf7bea0 (&(&p->tcfa_lock)->rlock){+...}, at: tc_setup_flow_action+0xbe/0x4f78
[  157.494119]
[  157.494119] stack backtrace:
[  157.499114] CPU: 2 PID: 877 Comm: tc Not tainted 5.6.0-rc5-custom-47426-gdfe43878d573 #2409
[  157.508485] Hardware name: Mellanox Technologies Ltd. MSN2100-CB2FO/SA001017, BIOS 5.6.5 06/07/2016
[  157.518628] Call Trace:
[  157.521416]  dump_stack+0xfd/0x178
[  157.525293]  lockdep_rcu_suspicious+0x14a/0x153
[  157.530425]  tc_setup_flow_action+0x993/0x4f78
[  157.535505]  fl_hw_replace_filter+0x375/0x640
[  157.545650]  fl_change+0x28ec/0x4f6b
[  157.570204]  tc_new_tfilter+0x15e2/0x2260
[  157.658199]  rtnetlink_rcv_msg+0x8d6/0xb60
[  157.672315]  netlink_rcv_skb+0x177/0x460
[  157.690278]  rtnetlink_rcv+0x21/0x30
[  157.694320]  netlink_unicast+0x5d0/0x7f0
[  157.709129]  netlink_sendmsg+0x981/0xe90
[  157.726813]  ____sys_sendmsg+0x76d/0x8f0
[  157.736725]  ___sys_sendmsg+0x10f/0x190
[  157.801721]  __sys_sendmsg+0x115/0x1f0
[  157.823967]  __x64_sys_sendmsg+0x7d/0xc0
[  157.828403]  do_syscall_64+0xc1/0x600
[  157.832558]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: b15e7a6e8d ("net: sched: don't take rtnl lock during flow_action setup")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-19 20:25:52 -07:00
Pablo Neira Ayuso
65038428b2 netfilter: nf_tables: allow to specify stateful expression in set definition
This patch allows users to specify the stateful expression for the
elements in this set via NFTA_SET_EXPR. This new feature allows you to
turn on counters for all of the elements in this set.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:31 +01:00
Pablo Neira Ayuso
c604cc691c netfilter: nf_tables: move nft_expr_clone() to nf_tables_api.c
Move the nft_expr_clone() helper function to the core.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-19 11:37:31 +01:00
David S. Miller
a58741ef1e Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey.

2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi.
   Follow up patch to clean up this new mode, from Dan Carpenter.

3) Add support for geneve tunnel options, from Xin Long.

4) Make sets built-in and remove modular infrastructure for sets,
   from Florian Westphal.

5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing.

6) Statify nft_pipapo_get, from Chen Wandun.

7) Use C99 flexible-array member, from Gustavo A. R. Silva.

8) More descriptive variable names for bitwise, from Jeremy Sowden.

9) Four patches to add tunnel device hardware offload to the flowtable
   infrastructure, from wenxu.

10) pipapo set supports for 8-bit grouping, from Stefano Brivio.

11) pipapo can switch between nibble and byte grouping, also from
    Stefano.

12) Add AVX2 vectorized version of pipapo, from Stefano Brivio.

13) Update pipapo to be use it for single ranges, from Stefano.

14) Add stateful expression support to elements via control plane,
    eg. counter per element.

15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal.

15) Add new egress hook, from Lukas Wunner.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 23:51:31 -07:00
Eric Dumazet
efe074c2cc net_sched: add qdisc_watchdog_schedule_range_ns()
Some packet schedulers might want to add a slack
when programming hrtimers. This can reduce number
of interrupts and increase batch sizes and thus
give good xmit_more savings.

This commit adds qdisc_watchdog_schedule_range_ns()
helper, with an extra delta_ns parameter.

Legacy qdisc_watchdog_schedule_n() becomes an inline
passing a zero slack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 21:16:34 -07:00
Jakub Kicinski
53eca1f347 net: rename flow_action_hw_stats_types* -> flow_action_hw_stats*
flow_action_hw_stats_types_check() helper takes one of the
FLOW_ACTION_HW_STATS_*_BIT values as input. If we align
the arguments to the opening bracket of the helper there
is no way to call this helper and stay under 80 characters.

Remove the "types" part from the new flow_action helpers
and enum values.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-17 21:12:39 -07:00
Era Mayflower
48ef50fa86 macsec: Netlink support of XPN cipher suites (IEEE 802.1AEbw)
Netlink support of extended packet number cipher suites,
allows adding and updating XPN macsec interfaces.

Added support in:
    * Creating interfaces with GCM-AES-XPN-128 and GCM-AES-XPN-256 suites.
    * Setting and getting 64bit packet numbers with of SAs.
    * Setting (only on SA creation) and getting ssci of SAs.
    * Setting salt when installing a SAK.

Added 2 cipher suite identifiers according to 802.1AE-2018 table 14-1:
    * MACSEC_CIPHER_ID_GCM_AES_XPN_128
    * MACSEC_CIPHER_ID_GCM_AES_XPN_256

In addition, added 2 new netlink attribute types:
    * MACSEC_SA_ATTR_SSCI
    * MACSEC_SA_ATTR_SALT

Depends on: macsec: Support XPN frame handling - IEEE 802.1AEbw.

Signed-off-by: Era Mayflower <mayflowerera@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 01:42:31 -07:00
Era Mayflower
a21ecf0e03 macsec: Support XPN frame handling - IEEE 802.1AEbw
Support extended packet number cipher suites (802.1AEbw) frames handling.
This does not include the needed netlink patches.

    * Added xpn boolean field to `struct macsec_secy`.
    * Added ssci field to `struct_macsec_tx_sa` (802.1AE figure 10-5).
    * Added ssci field to `struct_macsec_rx_sa` (802.1AE figure 10-5).
    * Added salt field to `struct macsec_key` (802.1AE 10.7 NOTE 1).
    * Created pn_t type for easy access to lower and upper halves.
    * Created salt_t type for easy access to the "ssci" and "pn" parts.
    * Created `macsec_fill_iv_xpn` function to create IV in XPN mode.
    * Support in PN recovery and preliminary replay check in XPN mode.

In addition, according to IEEE 802.1AEbw figure 10-5, the PN of incoming
frame can be 0 when XPN cipher suite is used, so fixed the function
`macsec_validate_skb` to fail on PN=0 only if XPN is off.

Signed-off-by: Era Mayflower <mayflowerera@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-16 01:42:31 -07:00
Pablo Neira Ayuso
76adfafeca netfilter: nf_tables: add nft_set_elem_update_expr() helper function
This helper function runs the eval path of the stateful expression
of an existing set element.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:49 +01:00
Pablo Neira Ayuso
795a6d6b42 netfilter: nf_tables: statify nft_expr_init()
Not exposed anymore to modules, statify this function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:48 +01:00
Pablo Neira Ayuso
a7fc936804 netfilter: nf_tables: add nft_set_elem_expr_alloc()
Add helper function to create stateful expression.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:47 +01:00
Stefano Brivio
7400b06396 nft_set_pipapo: Introduce AVX2-based lookup implementation
If the AVX2 set is available, we can exploit the repetitive
characteristic of this algorithm to provide a fast, vectorised
version by using 256-bit wide AVX2 operations for bucket loads and
bitwise intersections.

In most cases, this implementation consistently outperforms rbtree
set instances despite the fact they are configured to use a given,
single, ranged data type out of the ones used for performance
measurements by the nft_concat_range.sh kselftest.

That script, injecting packets directly on the ingoing device path
with pktgen, reports, averaged over five runs on a single AMD Epyc
7402 thread (3.35GHz, 768 KiB L1D$, 12 MiB L2$), the figures below.
CONFIG_RETPOLINE was not set here.

Note that this is not a fair comparison over hash and rbtree set
types: non-ranged entries (used to have a reference for hash types)
would be matched faster than this, and matching on a single field
only (which is the case for rbtree) is also significantly faster.

However, it's not possible at the moment to choose this set type
for non-ranged entries, and the current implementation also needs
a few minor adjustments in order to match on less than two fields.

 ---------------.-----------------------------------.------------.
 AMD Epyc 7402  |          baselines, Mpps          | this patch |
  1 thread      |___________________________________|____________|
  3.35GHz       |        |        |        |        |            |
  768KiB L1D$   | netdev |  hash  | rbtree |        |            |
 ---------------|  hook  |   no   | single |        |   pipapo   |
 type   entries |  drop  | ranges | field  | pipapo |    AVX2    |
 ---------------|--------|--------|--------|--------|------------|
 net,port       |        |        |        |        |            |
          1000  |   19.0 |   10.4 |    3.8 |    4.0 | 7.5   +87% |
 ---------------|--------|--------|--------|--------|------------|
 port,net       |        |        |        |        |            |
           100  |   18.8 |   10.3 |    5.8 |    6.3 | 8.1   +29% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port      |        |        |        |        |            |
          1000  |   16.4 |    7.6 |    1.8 |    2.1 | 4.8  +128% |
 ---------------|--------|--------|--------|--------|------------|
 port,proto     |        |        |        |        |            |
         30000  |   19.6 |   11.6 |    3.9 |    0.5 | 2.6  +420% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac  |        |        |        |        |            |
            10  |   16.5 |    5.4 |    4.3 |    3.4 | 4.7   +38% |
 ---------------|--------|--------|--------|--------|------------|
 net6,port,mac, |        |        |        |        |            |
 proto    1000  |   16.5 |    5.7 |    1.9 |    1.4 | 3.6   +26% |
 ---------------|--------|--------|--------|--------|------------|
 net,mac        |        |        |        |        |            |
          1000  |   19.0 |    8.4 |    3.9 |    2.5 | 6.4  +156% |
 ---------------'--------'--------'--------'--------'------------'

A similar strategy could be easily reused to implement specialised
versions for other SIMD sets, and I plan to post at least a NEON
version at a later time.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:27:45 +01:00
wenxu
cfab6dbd0e netfilter: flowtable: add tunnel match offload support
This patch support both ipv4 and ipv6 tunnel_id, tunnel_src and
tunnel_dst match for flowtable offload

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:26:17 +01:00
Gustavo A. R. Silva
6daf141401 netfilter: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

Lastly, fix checkpatch.pl warning
WARNING: __aligned(size) is preferred over __attribute__((aligned(size)))
in net/bridge/netfilter/ebtables.c

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Florian Westphal
24d19826fc netfilter: nf_tables: make all set structs const
They do not need to be writeable anymore.

v2: remove left-over __read_mostly annotation in set_pipapo.c (Stefano)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Florian Westphal
e32a4dc651 netfilter: nf_tables: make sets built-in
Placing nftables set support in an extra module is pointless:

1. nf_tables needs dynamic registeration interface for sake of one module
2. nft heavily relies on sets, e.g. even simple rule like
   "nft ... tcp dport { 80, 443 }" will not work with _SETS=n.

IOW, either nftables isn't used or both nf_tables and nf_tables_set
modules are needed anyway.

With extra module:
 307K net/netfilter/nf_tables.ko
  79K net/netfilter/nf_tables_set.ko

   text  data  bss     dec filename
 146416  3072  545  150033 nf_tables.ko
  35496  1817    0   37313 nf_tables_set.ko

This patch:
 373K net/netfilter/nf_tables.ko

 178563  4049  545  183157 nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-03-15 15:20:16 +01:00
Petr Machata
0a7fad2376 net: sched: RED: Introduce an ECN nodrop mode
When the RED Qdisc is currently configured to enable ECN, the RED algorithm
is used to decide whether a certain SKB should be marked. If that SKB is
not ECN-capable, it is early-dropped.

It is also possible to keep all traffic in the queue, and just mark the
ECN-capable subset of it, as appropriate under the RED algorithm. Some
switches support this mode, and some installations make use of it.

To that end, add a new RED flag, TC_RED_NODROP. When the Qdisc is
configured with this flag, non-ECT traffic is enqueued instead of being
early-dropped.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 21:03:46 -07:00
Petr Machata
14bc175d9c net: sched: Allow extending set of supported RED flags
The qdiscs RED, GRED, SFQ and CHOKE use different subsets of the same pool
of global RED flags. These are passed in tc_red_qopt.flags. However none of
these qdiscs validate the flag field, and just copy it over wholesale to
internal structures, and later dump it back. (An exception is GRED, which
does validate for VQs -- however not for the main setup.)

A broken userspace can therefore configure a qdisc with arbitrary
unsupported flags, and later expect to see the flags on qdisc dump. The
current ABI therefore allows storage of several bits of custom data to
qdisc instances of the types mentioned above. How many bits, depends on
which flags are meaningful for the qdisc in question. E.g. SFQ recognizes
flags ECN and HARDDROP, and the rest is not interpreted.

If SFQ ever needs to support ADAPTATIVE, it needs another way of doing it,
and at the same time it needs to retain the possibility to store 6 bits of
uninterpreted data. Likewise RED, which adds a new flag later in this
patchset.

To that end, this patch adds a new function, red_get_flags(), to split the
passed flags of RED-like qdiscs to flags and user bits, and
red_validate_flags() to validate the resulting configuration. It further
adds a new attribute, TCA_RED_FLAGS, to pass arbitrary flags.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-14 21:03:46 -07:00
David S. Miller
44ef976ab3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-03-13

The following pull-request contains BPF updates for your *net-next* tree.

We've added 86 non-merge commits during the last 12 day(s) which contain
a total of 107 files changed, 5771 insertions(+), 1700 deletions(-).

The main changes are:

1) Add modify_return attach type which allows to attach to a function via
   BPF trampoline and is run after the fentry and before the fexit programs
   and can pass a return code to the original caller, from KP Singh.

2) Generalize BPF's kallsyms handling and add BPF trampoline and dispatcher
   objects to be visible in /proc/kallsyms so they can be annotated in
   stack traces, from Jiri Olsa.

3) Extend BPF sockmap to allow for UDP next to existing TCP support in order
   in order to enable this for BPF based socket dispatch, from Lorenz Bauer.

4) Introduce a new bpftool 'prog profile' command which attaches to existing
   BPF programs via fentry and fexit hooks and reads out hardware counters
   during that period, from Song Liu. Example usage:

   bpftool prog profile id 337 duration 3 cycles instructions llc_misses

        4228 run_cnt
     3403698 cycles                                              (84.08%)
     3525294 instructions   #  1.04 insn per cycle               (84.05%)
          13 llc_misses     #  3.69 LLC misses per million isns  (83.50%)

5) Batch of improvements to libbpf, bpftool and BPF selftests. Also addition
   of a new bpf_link abstraction to keep in particular BPF tracing programs
   attached even when the applicaion owning them exits, from Andrii Nakryiko.

6) New bpf_get_current_pid_tgid() helper for tracing to perform PID filtering
   and which returns the PID as seen by the init namespace, from Carlos Neira.

7) Refactor of RISC-V JIT code to move out common pieces and addition of a
   new RV32G BPF JIT compiler, from Luke Nelson.

8) Add gso_size context member to __sk_buff in order to be able to know whether
   a given skb is GSO or not, from Willem de Bruijn.

9) Add a new bpf_xdp_output() helper which reuses XDP's existing perf RB output
   implementation but can be called from tracepoint programs, from Eelco Chaudron.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13 20:52:03 -07:00
David Howells
7d7587db0d afs: Fix client call Rx-phase signal handling
Fix the handling of signals in client rxrpc calls made by the afs
filesystem.  Ignore signals completely, leaving call abandonment or
connection loss to be detected by timeouts inside AF_RXRPC.

Allowing a filesystem call to be interrupted after the entire request has
been transmitted and an abort sent means that the server may or may not
have done the action - and we don't know.  It may even be worse than that
for older servers.

Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13 23:04:35 +00:00
David Howells
e138aa7d32 rxrpc: Fix call interruptibility handling
Fix the interruptibility of kernel-initiated client calls so that they're
either only interruptible when they're waiting for a call slot to come
available or they're not interruptible at all.  Either way, they're not
interruptible during transmission.

This should help prevent StoreData calls from being interrupted when
writeback is in progress.  It doesn't, however, handle interruption during
the receive phase.

Userspace-initiated calls are still interruptable.  After the signal has
been handled, sendmsg() will return the amount of data copied out of the
buffer and userspace can perform another sendmsg() call to continue
transmission.

Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Signed-off-by: David Howells <dhowells@redhat.com>
2020-03-13 23:04:30 +00:00
David S. Miller
48f5d5cb80 Merge tag 'ieee802154-for-davem-2020-03-13' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next
Stefan Schmidt says:

====================
pull-request: ieee802154-next 2020-03-13

An update from ieee802154 for *net-next*

Two small patches with updates targeting the whole tree.
Sergin does update SPI drivers to the new transfer delay handling
and Gustavo did one of his zero-length array replacement patches.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-13 11:27:22 -07:00
David S. Miller
1d34357931 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes, nothing serious.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 22:34:48 -07:00
Masahiro Yamada
1cd9b3abf5 net: drop_monitor: use IS_REACHABLE() to guard net_dm_hw_report()
In net/Kconfig, NET_DEVLINK implies NET_DROP_MONITOR.

The original behavior of the 'imply' keyword prevents NET_DROP_MONITOR
from being 'm' when NET_DEVLINK=y.

With the planned Kconfig change that relaxes the 'imply', the
combination of NET_DEVLINK=y and NET_DROP_MONITOR=m would be allowed.

Use IS_REACHABLE() to avoid the vmlinux link error for this case.

Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
2020-03-13 10:04:45 +09:00
Paul Blakey
ee1c45e875 flow_offload: Add flow_match_ct to get rule ct match
Add relevant getter for ct info dissector.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:39 -07:00
Paul Blakey
edd5861e59 net/sched: act_ct: Enable hardware offload of flow table entires
Pass the zone's flow table instance on the flow action to the drivers.
Thus, allowing drivers to register FT add/del/stats callbacks.

Finally, enable hardware offload on the flow table instance.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:39 -07:00
Paul Blakey
8b3646d6e0 net/sched: act_ct: Support refreshing the flow table entries
If driver deleted an FT entry, a FT failed to offload, or registered to the
flow table after flows were already added, we still get packets in
software.

For those packets, while restoring the ct state from the flow table
entry, refresh it's hardware offload.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:39 -07:00
Paul Blakey
30b0cf90c6 net/sched: act_ct: Support restoring conntrack info on skbs
Provide an API to restore the ct state pointer.

This may be used by drivers to restore the ct state if they
miss in tc chain after they already did the hardware connection
tracking action (ct_metadata action).

For example, consider the following rule on chain 0 that is in_hw,
however chain 1 is not_in_hw:

$ tc filter add dev ... chain 0 ... \
  flower ... action ct pipe action goto chain 1

Packets of a flow offloaded (via nf flow table offload) by the driver
hit this rule in hardware, will be marked with the ct metadata action
(mark, label, zone) that does the equivalent of the software ct action,
and when the packet jumps to hardware chain 1, there would be a miss.

CT was already processed in hardware. Therefore, the driver's miss
handling should restore the ct state on the skb, using the provided API,
and continue the packet processing in chain 1.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:38 -07:00
Paul Blakey
9c26ba9b1f net/sched: act_ct: Instantiate flow table entry actions
NF flow table API associate 5-tuple rule with an action list by calling
the flow table type action() CB to fill the rule's actions.

In action CB of act_ct, populate the ct offload entry actions with a new
ct_metadata action. Initialize the ct_metadata with the ct mark, label and
zone information. If ct nat was performed, then also append the relevant
packet mangle actions (e.g. ipv4/ipv6/tcp/udp header rewrites).

Drivers that offload the ft entries may match on the 5-tuple and perform
the action list.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:38 -07:00
Paul Blakey
978703f425 netfilter: flowtable: Add API for registering to flow table events
Let drivers to add their cb allowing them to receive flow offload events
of type TC_SETUP_CLSFLOWER (REPLACE/DEL/STATS) for flows managed by the
flow table.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 15:00:38 -07:00
David S. Miller
bf3347c4d1 Merge branch 'ct-offload' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux 2020-03-12 12:34:23 -07:00
Kuniyuki Iwashima
4b01a96742 tcp: bind(0) remove the SO_REUSEADDR restriction when ephemeral ports are exhausted.
Commit aacd9289af ("tcp: bind() use stronger
condition for bind_conflict") introduced a restriction to forbid to bind
SO_REUSEADDR enabled sockets to the same (addr, port) tuple in order to
assign ports dispersedly so that we can connect to the same remote host.

The change results in accelerating port depletion so that we fail to bind
sockets to the same local port even if we want to connect to the different
remote hosts.

You can reproduce this issue by following instructions below.

  1. # sysctl -w net.ipv4.ip_local_port_range="32768 32768"
  2. set SO_REUSEADDR to two sockets.
  3. bind two sockets to (localhost, 0) and the latter fails.

Therefore, when ephemeral ports are exhausted, bind(0) should fallback to
the legacy behaviour to enable the SO_REUSEADDR option and make it possible
to connect to different remote (addr, port) tuples.

This patch allows us to bind SO_REUSEADDR enabled sockets to the same
(addr, port) only when net.ipv4.ip_autobind_reuse is set 1 and all
ephemeral ports are exhausted. This also allows connect() and listen() to
share ports in the following way and may break some applications. So the
ip_autobind_reuse is 0 by default and disables the feature.

  1. setsockopt(sk1, SO_REUSEADDR)
  2. setsockopt(sk2, SO_REUSEADDR)
  3. bind(sk1, saddr, 0)
  4. bind(sk2, saddr, 0)
  5. connect(sk1, daddr)
  6. listen(sk2)

If it is set 1, we can fully utilize the 4-tuples, but we should use
IP_BIND_ADDRESS_NO_PORT for bind()+connect() as possible.

The notable thing is that if all sockets bound to the same port have
both SO_REUSEADDR and SO_REUSEPORT enabled, we can bind sockets to an
ephemeral port and also do listen().

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 12:08:09 -07:00
Julian Wiedmann
7c4046b1c5 Revert "net: sched: make newly activated qdiscs visible"
This reverts commit 4cda75275f
from net-next.

Brown bag time.

Michal noticed that this change doesn't work at all when
netif_set_real_num_tx_queues() gets called prior to an initial
dev_activate(), as for instance igb does.

Doing so dies with:

[   40.579142] BUG: kernel NULL pointer dereference, address: 0000000000000400
[   40.586922] #PF: supervisor read access in kernel mode
[   40.592668] #PF: error_code(0x0000) - not-present page
[   40.598405] PGD 0 P4D 0
[   40.601234] Oops: 0000 [#1] PREEMPT SMP PTI
[   40.605909] CPU: 18 PID: 1681 Comm: wickedd Tainted: G            E     5.6.0-rc3-ethnl.50-default #1
[   40.616205] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS RMLSDP.86I.R3.27.D685.1305151734 05/15/2013
[   40.627377] RIP: 0010:qdisc_hash_add.part.22+0x2e/0x90
[   40.633115] Code: 00 55 53 89 f5 48 89 fb e8 2f 9b fb ff 85 c0 74 44 48 8b 43 40 48 8b 08 69 43 38 47 86 c8 61 c1 e8 1c 48 83 e8 80 48 8d 14 c1 <48> 8b 04 c1 48 8d 4b 28 48 89 53 30 48 89 43 28 48 85 c0 48 89 0a
[   40.654080] RSP: 0018:ffffb879864934d8 EFLAGS: 00010203
[   40.659914] RAX: 0000000000000080 RBX: ffffffffb8328d80 RCX: 0000000000000000
[   40.667882] RDX: 0000000000000400 RSI: 0000000000000000 RDI: ffffffffb831faa0
[   40.675849] RBP: 0000000000000000 R08: ffffa0752c8b9088 R09: ffffa0752c8b9208
[   40.683816] R10: 0000000000000006 R11: 0000000000000000 R12: ffffa0752d734000
[   40.691783] R13: 0000000000000008 R14: 0000000000000000 R15: ffffa07113c18000
[   40.699750] FS:  00007f94548e5880(0000) GS:ffffa0752e980000(0000) knlGS:0000000000000000
[   40.708782] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   40.715189] CR2: 0000000000000400 CR3: 000000082b6ae006 CR4: 00000000001606e0
[   40.723156] Call Trace:
[   40.725888]  dev_qdisc_set_real_num_tx_queues+0x61/0x90
[   40.731725]  netif_set_real_num_tx_queues+0x94/0x1d0
[   40.737286]  __igb_open+0x19a/0x5d0 [igb]
[   40.741767]  __dev_open+0xbb/0x150
[   40.745567]  __dev_change_flags+0x157/0x1a0
[   40.750240]  dev_change_flags+0x23/0x60

[...]

Fixes: 4cda75275f ("net: sched: make newly activated qdiscs visible")
Reported-by: Michal Kubecek <mkubecek@suse.cz>
CC: Michal Kubecek <mkubecek@suse.cz>
CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-12 11:19:24 -07:00
Julian Wiedmann
4cda75275f net: sched: make newly activated qdiscs visible
In their .attach callback, mq[prio] only add the qdiscs of the currently
active TX queues to the device's qdisc hash list.
If a user later increases the number of active TX queues, their qdiscs
are not visible via eg. 'tc qdisc show'.

Add a hook to netif_set_real_num_tx_queues() that walks all active
TX queues and adds those which are missing to the hash list.

CC: Eric Dumazet <edumazet@google.com>
CC: Jamal Hadi Salim <jhs@mojatatu.com>
CC: Cong Wang <xiyou.wangcong@gmail.com>
CC: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-11 23:17:28 -07:00
Abhishek Pandit-Subedi
4867bd007d Bluetooth: Pause discovery and advertising during suspend
To prevent spurious wake ups, we disable any discovery or advertising
when we enter suspend and restore it when we exit suspend. While paused,
we disable any management requests to modify discovery or advertising.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:03:49 +01:00
Abhishek Pandit-Subedi
dd522a7429 Bluetooth: Handle LE devices during suspend
To handle LE devices, we must first disable passive scanning and
disconnect all connected devices. Once that is complete, we update the
whitelist and re-enable scanning

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:02:46 +01:00
Abhishek Pandit-Subedi
4f40afc6c7 Bluetooth: Handle BR/EDR devices during suspend
To handle BR/EDR devices, we first disable page scan and disconnect all
connected devices. Once that is complete, we add event filters (for
devices that can wake the system) and re-enable page scan.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:01:47 +01:00
Abhishek Pandit-Subedi
9952d90ea2 Bluetooth: Handle PM_SUSPEND_PREPARE and PM_POST_SUSPEND
Register for PM_SUSPEND_PREPARE and PM_POST_SUSPEND to make sure the
Bluetooth controller is prepared correctly for suspend/resume. Implement
the registration, scheduling and task handling portions only in this
patch.

Signed-off-by: Abhishek Pandit-Subedi <abhishekpandit@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-11 18:00:48 +01:00
Jiri Pirko
a16fa28984 flow_offload: restrict driver to pass one allowed bit to flow_action_hw_stats_types_check()
The intention of this helper was to allow driver to specify one type
that it supports, so not only "any" value would pass. So make the API
more strict and allow driver to pass only 1 bit that is going
to be checked.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 16:04:19 -07:00
Jiri Pirko
42d5fe5f9c flow_offload: turn hw_stats_type into dedicated enum
Put the values into enum and add an enum to define the bits.

Suggested-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 16:04:19 -07:00
Jiri Pirko
a393daa899 flow_offload: fix allowed types check
Change the check to see if the passed allowed type bit is enabled.

Fixes: 319a1d1947 ("flow_offload: check for basic action hw stats type")
Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 16:04:19 -07:00
Jiri Pirko
62751b6808 flow_offload: use flow_action_for_each in flow_action_mixed_hw_stats_types_check()
Instead of manually iterating over entries, use flow_action_for_each
helper. Move the helper and wrap it to fit to 80 cols on the way.

Signed-off-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-10 15:43:25 -07:00
Jens Axboe
0a384abfae net: abstract out normal and compat msghdr import
This splits it into two parts, one that imports the message, and one
that imports the iovec. This allows a caller to only do the first part,
and import the iovec manually afterwards.

No functional changes in this patch.

Acked-by: David Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2020-03-10 09:12:49 -06:00
Lorenz Bauer
edc6741cc6 bpf: Add sockmap hooks for UDP sockets
Add basic psock hooks for UDP sockets. This allows adding and
removing sockets, as well as automatic removal on unhash and close.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-8-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
f747632b60 bpf: sockmap: Move generic sockmap hooks from BPF TCP
The init, close and unhash handlers from TCP sockmap are generic,
and can be reused by UDP sockmap. Move the helpers into the sockmap code
base and expose them. This requires tcp_bpf_get_proto and tcp_bpf_clone to
be conditional on BPF_STREAM_PARSER.

The moved functions are unmodified, except that sk_psock_unlink is
renamed to sock_map_unlink to better match its behaviour.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-6-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
5da0040442 bpf: tcp: Guard declarations with CONFIG_NET_SOCK_MSG
tcp_bpf.c is only included in the build if CONFIG_NET_SOCK_MSG is
selected. The declaration should therefore be guarded as such.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-5-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
d19da360ee bpf: tcp: Move assertions into tcp_bpf_get_proto
We need to ensure that sk->sk_prot uses certain callbacks, so that
code that directly calls e.g. tcp_sendmsg in certain corner cases
works. To avoid spurious asserts, we must to do this only if
sk_psock_update_proto has not yet been called. The same invariants
apply for tcp_bpf_check_v6_needs_rebuild, so move the call as well.

Doing so allows us to merge tcp_bpf_init and tcp_bpf_reinit.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-4-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Lorenz Bauer
7b70973d7e bpf: sockmap: Only check ULP for TCP sockets
The sock map code checks that a socket does not have an active upper
layer protocol before inserting it into the map. This requires casting
via inet_csk, which isn't valid for UDP sockets.

Guard checks for ULP by checking inet_sk(sk)->is_icsk first.

Signed-off-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200309111243.6982-2-lmb@cloudflare.com
2020-03-09 22:34:58 +01:00
Jiri Pirko
44f8658017 sched: act: allow user to specify type of HW stats for a filter
Currently, user who is adding an action expects HW to report stats,
however it does not have exact expectations about the stats types.
That is aligned with TCA_ACT_HW_STATS_TYPE_ANY.

Allow user to specify the type of HW stats for an action and require it.

Pass the information down to flow_offload layer.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Jiri Pirko
d7cb1e3ba1 flow_offload: introduce "disabled" HW stats type and allow it in mlxsw
Introduce new type for disabled HW stats and allow the value in
mlxsw offload.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Jiri Pirko
4885547951 flow_offload: introduce "delayed" HW stats type and allow it in mlx5
Introduce new type for delayed HW stats and allow the value in
mlx5 offload.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Jiri Pirko
d60d7ed4c8 flow_offload: introduce "immediate" HW stats type and allow it in mlxsw
Introduce new type for immediate HW stats and allow the value in
mlxsw offload.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Jiri Pirko
319a1d1947 flow_offload: check for basic action hw stats type
Introduce flow_action_basic_hw_stats_types_check() helper and use it
in drivers. That sanitizes the drivers which do not have support
for action HW stats types.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:48 -07:00
Jiri Pirko
2514921e72 flow_offload: Introduce offload of HW stats type
Initially, pass "ANY" (struct is zeroed) to the drivers as that is the
current implicit value coming down to flow_offload. Add a bool
indicating that entries have mixed HW stats type.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-08 21:07:47 -07:00
Luiz Augusto von Dentz
4be5ca67d5 Bluetooth: L2CAP: Add module option to enable ECRED mode
This should make it safe to have the code upstream without affecting
stable systems since there are a few details not sort out with ECRED
mode e.g: how to initiate multiple connections at once.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 10:05:40 +01:00
Luiz Augusto von Dentz
15f02b9105 Bluetooth: L2CAP: Add initial code for Enhanced Credit Based Mode
This adds the initial code for Enhanced Credit Based Mode which
introduces a new socket mode called L2CAP_MODE_EXT_FLOWCTL, which for
the most part work the same as L2CAP_MODE_LE_FLOWCTL but uses different
PDUs to setup the connections and also works over BR/EDR.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 10:05:40 +01:00
Luiz Augusto von Dentz
145720963b Bluetooth: L2CAP: Add definitions for Enhanced Credit Based Mode
This introduces the definitions for the new L2CAP mode called Enhanced
Credit Based Mode.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 10:05:39 +01:00
Alain Michaud
00bce3fb06 Bluetooth: Enable erroneous data reporting if WBS is supported
This change introduces a wide band speech setting which allows higher
level clients to query the local controller support for wide band speech
as well as set the setting state when the radio is powered off.
Internally, this setting controls if erroneous data reporting is enabled
on the controller.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-03-08 08:50:42 +01:00
Petr Machata
aaca940807 net: sched: Make FIFO Qdisc offloadable
Invoke ndo_setup_tc() as appropriate to signal init / replacement,
destroying and dumping of pFIFO / bFIFO Qdisc.

A lot of the FIFO logic is used for pFIFO_head_drop as well, but that's a
semantically very different Qdisc that isn't really in the same boat as
pFIFO / bFIFO. Split some of the functions to keep the Qdisc intact.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-05 14:03:31 -08:00
Leslie Monis
5c5840e4b9 pie: realign comment
Realign a comment after the change introduced by the
previous patch.

Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 13:25:55 -08:00
Leslie Monis
105e808c1d pie: remove pie_vars->accu_prob_overflows
The variable pie_vars->accu_prob is used as an accumulator for
probability values. Since probabilty values are scaled using the
MAX_PROB macro denoting (2^64 - 1), pie_vars->accu_prob is
likely to overflow as it is of type u64.

The variable pie_vars->accu_prob_overflows counts the number of
times the variable pie_vars->accu_prob overflows.

The MAX_PROB macro needs to be equal to at least (2^39 - 1) in
order to do precise calculations without any underflow. Thus
MAX_PROB can be reduced to (2^56 - 1) without affecting the
precision in calculations drastically. Doing so will eliminate
the need for the variable pie_vars->accu_prob_overflows as the
variable pie_vars->accu_prob will never overflow.

Removing the variable pie_vars->accu_prob_overflows also reduces
the size of the structure pie_vars to exactly 64 bytes.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 13:25:55 -08:00
Leslie Monis
90baeb9dd2 pie: use term backlog instead of qlen
Remove ambiguity by using the term backlog instead of qlen when
representing the queue length in bytes.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-04 13:25:55 -08:00
Vladimir Oltean
ed11bb1f96 net: dsa: Add bypass operations for the flower classifier-action filter
Due to the immense variety of classification keys and actions available
for tc-flower, as well as due to potentially very different DSA switch
capabilities, it doesn't make a lot of sense for the DSA mid layer to
even attempt to interpret these. So just pass them on to the underlying
switch driver.

DSA implements just the standard boilerplate for binding and unbinding
flow blocks to ports, since nobody wants to deal with that.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 18:57:49 -08:00
Paul Blakey
c34b961a24 net/sched: act_ct: Create nf flow table per zone
Use the NF flow tables infrastructure for CT offload.

Create a nf flow table per zone.

Next patches will add FT entries to this table, and do
the software offload.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 15:09:12 -08:00
Jakub Kicinski
4c16d64ea0 fib: add missing attribute validation for tun_id
Add missing netlink policy entry for FRA_TUN_ID.

Fixes: e7030878fc ("fib: Add fib rule match on tunnel id")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-03 13:28:48 -08:00
Gustavo A. R. Silva
48b77df665 net: inet_sock: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-02 11:16:28 -08:00
Gustavo A. R. Silva
6e68f499e9 net: ip6_fib: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-02 11:16:28 -08:00
Gustavo A. R. Silva
a53110609c net: ip_fib: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-02 11:16:28 -08:00
Gustavo A. R. Silva
2e83abdcb3 net: mip6: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-03-02 11:16:27 -08:00
Gustavo A. R. Silva
08ca27d027 neighbour: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
8661b6e7c4 net: flow_offload: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
a79b41ec98 net: dn_fib: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
53e76f4824 ndisc: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
c61a2a76e5 net: ipv6: mld: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
e8316026d5 net: lwtunnel: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
207644f513 net: ip6_route: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
97a888c2ff net: nexthop: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:20 -08:00
Gustavo A. R. Silva
2e7aaaa19c net: sctp: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:19 -08:00
Gustavo A. R. Silva
2603c29e6c net: sock_reuseport: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:52:19 -08:00
Gustavo A. R. Silva
b90feaff2a net: sched: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 21:27:02 -08:00
David S. Miller
9f0ca0c1a5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-02-28

The following pull-request contains BPF updates for your *net-next* tree.

We've added 41 non-merge commits during the last 7 day(s) which contain
a total of 49 files changed, 1383 insertions(+), 499 deletions(-).

The main changes are:

1) BPF and Real-Time nicely co-exist.

2) bpftool feature improvements.

3) retrieve bpf_sk_storage via INET_DIAG.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-29 15:53:35 -08:00
Gustavo A. R. Silva
53cb2cfaa6 cfg802154: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2020-02-29 14:39:08 +01:00
Paolo Abeni
7782040b95 unix: uses an atomic type for scm files accounting
So the scm_stat_{add,del} helper can be invoked with no
additional lock held.

This clean-up the code a bit and will make the next
patch easier.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:12:53 -08:00
Gustavo A. R. Silva
e955376277 af_unix: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
749db09304 bonding: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
680a93166e net: mpls: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-28 12:08:37 -08:00
Gustavo A. R. Silva
8788a1ee07 6lowpan: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 14:51:30 +01:00
Alain Michaud
4b127bd5f2 Bluetooth: Support querying for WBS support through MGMT
This patch provides a mechanism for MGMT interface client to query the
capability of the controller to support WBS.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:53:05 +01:00
Alain Michaud
e37c57a888 Bluetooth: Fixing a few comment typos in the quirk definitions.
This change simply fixes a few typos in the quirk definitions.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:45:09 +01:00
Sathish Narsimman
a9e45698b3 Bluetooth: Remove adv set for directed advertising
Extended advertising Data is set during bluetooth initialization
by default which causes InvalidHCICommandParameters when setting
Extended advertising parameters.

As per Core Spec 5.2 Vol 2, PART E, Sec 7.8.53, for
advertising_event_property LE_LEGACY_ADV_DIRECT_IND does not
supports advertising data when the advertising set already
contains some, the controller shall return erroc code
'InvalidHCICommandParameters(0x12).

So it is required to remove adv set for handle 0x00. since we use
instance 0 for directed adv.

Signed-off-by: Sathish Narsimman <sathish.narasimman@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:38:05 +01:00
Gustavo A. R. Silva
a7e454542b Bluetooth: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-28 08:30:02 +01:00
Martin KaFai Lau
1ed4d92458 bpf: INET_DIAG support in bpf_sk_storage
This patch adds INET_DIAG support to bpf_sk_storage.

1. Although this series adds bpf_sk_storage diag capability to inet sk,
   bpf_sk_storage is in general applicable to all fullsock.  Hence, the
   bpf_sk_storage logic will operate on SK_DIAG_* nlattr.  The caller
   will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
   the argument.

2. The request will be like:
	INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
		SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
		......

   Considering there could have multiple bpf_sk_storages in a sk,
   instead of reusing INET_DIAG_INFO ("ss -i"),  the user can select
   some specific bpf_sk_storage to dump by specifying an array of
   SK_DIAG_BPF_STORAGE_REQ_MAP_FD.

   If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
   INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
   of a sk.

3. The reply will be like:
	INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
		SK_DIAG_BPF_STORAGE (nla_nest)
			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
		SK_DIAG_BPF_STORAGE (nla_nest)
			SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
			SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
		......

4. Unlike other INET_DIAG info of a sk which is pretty static, the size
   required to dump the bpf_sk_storage(s) of a sk is dynamic as the
   system adding more bpf_sk_storage_map.  It is hard to set a static
   min_dump_alloc size.

   Hence, this series learns it at the runtime and adjust the
   cb->min_dump_alloc as it iterates all sk(s) of a system.  The
   "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
   is for this purpose.

   The next patch will update the cb->min_dump_alloc as it
   iterates the sk(s).

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Song Liu <songliubraving@fb.com>
Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com
2020-02-27 18:50:19 -08:00
Gustavo A. R. Silva
da60fbe7ef NFC: Replace zero-length array with flexible-array member
The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
        int stuff;
        struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 7649773293 ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 12:06:20 -08:00
Russell King
5b502a7b29 net: dsa: propagate resolved link config via mac_link_up()
Propagate the resolved link configuration down via DSA's
phylink_mac_link_up() operation to allow split PCS/MAC to work.

Tested-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-27 12:02:14 -08:00
Jiri Pirko
5a2e106c74 devlink: extend devlink_trap_report() to accept cookie and pass
Add cookie argument to devlink_trap_report() allowing driver to pass in
the user cookie. Pass on the cookie down to drop monitor code.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
Jiri Pirko
742b8cceaa drop_monitor: extend by passing cookie from driver
If driver passed along the cookie, push it through Netlink.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
Jiri Pirko
85b0589ede devlink: add trap metadata type for cookie
Allow driver to indicate cookie metadata for registered traps.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
Jiri Pirko
2008495d81 flow_offload: pass action cookie through offload structures
Extend struct flow_action_entry in order to hold TC action cookie
specified by user inserting the action.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-25 11:05:54 -08:00
David S. Miller
3b3e808cd8 A new set of changes:
* lots of small documentation fixes, from Jérôme Pouiller
  * beacon protection (BIGTK) support from Jouni Malinen
  * some initial code for TID configuration, from Tamizh chelvam
  * I reverted some new API before it's actually used, because
    it's wrong to mix controlled port and preauth
  * a few other cleanups/fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5UFlsACgkQB8qZga/f
 l8SwNQ//Y/dELGODumDxY03l/Bj/+XL7HYl3Yn+J8mt2mYPC3zjTjvcRJBApiAog
 1Oyd75fd+EqjxDUT+ngN8uJLQ/yCVsdwfpZhwV7VanAOuLI9aYkIXnai/Qs96rj3
 yDIlrBVsmfsaxf/2e9UmsLmeUSm7C5s1EzaAIwoRGvcUH0pbH9WYAXF1QV+8fmXa
 yoXuHV5Bv+wOW2xWqWJsFpoV109AW24pwJm0vlILcpFP/jno2GsRvwEpnC/GJhEA
 4+jfEj0KlFkOewp0/HcqrUJp4yDEBhnhTTYgDL3hSWgKRVorKqY4/QmpKQCmpVQk
 Qrb6k+TrnLmKBQKdqfd+PKAEC9U/9Wjg0KLPyc9btBGFNSUG3QoDigzxxIvSlW6w
 2vyanDW6780FTIi8sA7sq1cBLosIyoFG44YYwbMidVtxhBk1LMRvetNAOnAJeycp
 Abbp/A2EdvzM+ZMNMRwWlsgig6WkGY7jy/zpcmQUdALM+yT2o7D5ZwxE5pa2ggds
 jf0eER1vVCEpOL70swNSZbAnDNCzBzTN64GX9gQIjdVT+nMUYQXwuOgEmho1FshD
 bgZz4PcaOCCSTice9GYCC3C9OqXBsE2DyBwytYYzahyDiQH13Iz6wEq/+tIIjzCN
 KKRdD12TXaPobLwv5zI3kogJ1I4P1fa6RZYpkngLpMbQrQ7qiDI=
 =x0rf
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-02-24' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A new set of changes:
 * lots of small documentation fixes, from Jérôme Pouiller
 * beacon protection (BIGTK) support from Jouni Malinen
 * some initial code for TID configuration, from Tamizh chelvam
 * I reverted some new API before it's actually used, because
   it's wrong to mix controlled port and preauth
 * a few other cleanups/fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 15:41:54 -08:00
Martin Varghese
4b5f67232d net: Special handling for IP & MPLS.
Special handling is needed in bareudp module for IP & MPLS as they
support more than one ethertypes.

MPLS has 2 ethertypes. 0x8847 for MPLS unicast and 0x8848 for MPLS multicast.
While decapsulating MPLS packet from UDP packet the tunnel destination IP
address is checked to determine the ethertype. The ethertype of the packet
will be set to 0x8848 if the  tunnel destination IP address is a multicast
IP address. The ethertype of the packet will be set to 0x8847 if the
tunnel destination IP address is a unicast IP address.

IP has 2 ethertypes.0x0800 for IPV4 and 0x86dd for IPv6. The version
field of the IP header tunnelled will be checked to determine the ethertype.

This special handling to tunnel additional ethertypes will be disabled
by default and can be enabled using a flag called multiproto. This flag can
be used only with ethertypes 0x8847 and 0x0800.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:31:42 -08:00
Martin Varghese
571912c69f net: UDP tunnel encapsulation module for tunnelling different protocols like MPLS, IP, NSH etc.
The Bareudp tunnel module provides a generic L3 encapsulation
tunnelling module for tunnelling different protocols like MPLS,
IP,NSH etc inside a UDP tunnel.

Signed-off-by: Martin Varghese <martin.varghese@nokia.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 13:31:42 -08:00
Jiri Pirko
ecd942a0ef devlink: add ACL generic packet traps
Add packet traps that can report packets that were dropped during ACL
processing.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-24 11:55:06 -08:00
Tamizh chelvam
370f51d5ed mac80211: Add api to support configuring TID specific configuration
Implement drv_set_tid_config api to allow TID specific
configuration and drv_reset_tid_config api to reset peer
specific TID configuration. This per-TID onfiguration
will be applied for all the connected stations when MAC is NULL.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-7-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 14:07:01 +01:00
Tamizh chelvam
04f7d142f5 nl80211: Add support to configure TID specific RTSCTS configuration
This patch adds support to configure per TID RTSCTS control
configuration to enable/disable through the
NL80211_TID_CONFIG_ATTR_RTSCTS_CTRL attribute.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-5-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 13:56:57 +01:00
Tamizh chelvam
ade274b23e nl80211: Add support to configure TID specific AMPDU configuration
This patch adds support to configure per TID AMPDU control
configuration to enable/disable aggregation through the
NL80211_TID_CONFIG_ATTR_AMPDU_CTRL attribute.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-4-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 13:56:49 +01:00
Tamizh chelvam
6a21d16c4d nl80211: Add support to configure TID specific retry configuration
This patch adds support to configure per TID retry configuration
through the NL80211_TID_CONFIG_ATTR_RETRY_SHORT and
NL80211_TID_CONFIG_ATTR_RETRY_LONG attributes. This TID specific
retry configuration will have more precedence than phy level
configuration.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-3-git-send-email-tamizhr@codeaurora.org
[rebase completely on top of my previous API changes]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 13:48:54 +01:00
Johannes Berg
3710a8a628 nl80211: modify TID-config API
Make some changes to the TID-config API:
 * use u16 in nl80211 (only, and restrict to using 8 bits for now),
   to avoid issues in the future if we ever want to use higher TIDs.
 * reject empty TIDs mask (via netlink policy)
 * change feature advertising to not use extended feature flags but
   have own mechanism for this, which simplifies the code
 * fix all variable names from 'tid' to 'tids' since it's a mask
 * change to cfg80211_ name prefixes, not ieee80211_
 * fix some minor docs/spelling things.

Change-Id: Ia234d464b3f914cdeab82f540e018855be580dce
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 12:01:10 +01:00
Tamizh chelvam
77f576deaa nl80211: Add NL command to support TID speicific configurations
Add the new NL80211_CMD_SET_TID_CONFIG command to support
data TID specific configuration. Per TID configuration is
passed in the nested NL80211_ATTR_TID_CONFIG attribute.

This patch adds support to configure per TID noack policy
through the NL80211_TID_CONFIG_ATTR_NOACK attribute.

Signed-off-by: Tamizh chelvam <tamizhr@codeaurora.org>
Link: https://lore.kernel.org/r/1579506687-18296-2-git-send-email-tamizhr@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 11:15:25 +01:00
Jouni Malinen
56be393fa8 cfg80211: Support key configuration for Beacon protection (BIGTK)
IEEE P802.11-REVmd/D3.0 adds support for protecting Beacon frames using
a new set of keys (BIGTK; key index 6..7) similarly to the way
group-addressed Robust Management frames are protected (IGTK; key index
4..5). Extend cfg80211 and nl80211 to allow the new BIGTK to be
configured. Add an extended feature flag to indicate driver support for
the new key index values to avoid array overflows in driver
implementations and also to indicate to user space when this
functionality is available.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20200222132548.20835-2-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:35:48 +01:00
Jérôme Pouiller
cd9b52bf75 cfg80211: fix indentation errors
Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-10-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:27:29 +01:00
Jérôme Pouiller
15bc6dfbe4 cfg80211: merge documentations of field "dev"
The field "dev" was documented on two places. This patch merges the
comments.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-8-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:27:22 +01:00
Jérôme Pouiller
edf77192f8 cfg80211: merge documentations of field "debugfsdir"
The field "privid" is documented twice. Comments were more or less the
same. The patch merge them.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-7-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:27:16 +01:00
Jérôme Pouiller
6046fdc963 cfg80211: drop duplicated documentation of field "reg_notifier"
The field "reg_notifier" was already documented above the definition of
struct wiphy. The comment inside the definition of the struct did not
bring more information.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-6-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:27:12 +01:00
Jérôme Pouiller
78fb559397 cfg80211: drop duplicated documentation of field "perm_addr"
The field "perm_addr" was already documented above the definition of
struct wiphy. Comments were almost identical.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-5-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:27:05 +01:00
Jérôme Pouiller
af18d341fb cfg80211: drop duplicated documentation of field "_net"
The field "_net" was already documented above the definition of struct
wiphy. Both comments were identical.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-4-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:26:58 +01:00
Jérôme Pouiller
4308d955a5 cfg80211: drop duplicated documentation of field "registered"
Field "registered" was documented three times: twice in the
documentation block of struct wiphy and once inside the struct
definition. This patch keep only one comment.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-3-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:26:51 +01:00
Jérôme Pouiller
cfb99437e2 cfg80211: drop duplicated documentation of field "privid"
The field "privid" was already documented above the definition of struct
wiphy. Comments were not identical, but they said more or less the same
thing.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-2-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:26:37 +01:00
Jérôme Pouiller
febc7ec6cb cfg80211: drop duplicated documentation of field "probe_resp_offload"
The field "probe_resp_offload" was already documented above the
definition of struct wiphy. Both comments were identical.

Signed-off-by: Jérôme Pouiller <jerome.pouiller@silabs.com>
Link: https://lore.kernel.org/r/20200221115604.594035-1-Jerome.Pouiller@silabs.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:26:16 +01:00
Johannes Berg
8d74a623cc Revert "nl80211: add src and dst addr attributes for control port tx/rx"
This reverts commit 8c3ed7aa2b.

As Jouni points out, there's really no need for this, since the
RSN pre-authentication frames are normal data frames, not port
control frames (locally).

We can still revert this now since it hasn't actually gone beyond
-next.

Fixes: 8c3ed7aa2b ("nl80211: add src and dst addr attributes for control port tx/rx")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200224101910.b746e263287a.I9eb15d6895515179d50964dec3550c9dc784bb93@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-24 10:22:02 +01:00
David S. Miller
b105e8e281 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2020-02-21

The following pull-request contains BPF updates for your *net-next* tree.

We've added 25 non-merge commits during the last 4 day(s) which contain
a total of 33 files changed, 2433 insertions(+), 161 deletions(-).

The main changes are:

1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
   with reuseport BPF programs, from Jakub Sitnicki.

2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
   to specify the tracepoint/function dynamically, from Eelco Chaudron.

3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
   guided optimizations, from Daniel Xu.

4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.

5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
   BTF defined maps, from Andrii Nakryiko.

6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
   been observed that former fails in envs with low memlock, from Yonghong Song.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 15:22:45 -08:00
David S. Miller
e65ee2fb54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Conflict resolution of ice_virtchnl_pf.c based upon work by
Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-21 13:39:34 -08:00
Jakub Sitnicki
035ff358f2 net: Generate reuseport group ID on group creation
Commit 736b46027e ("net: Add ID (if needed) to sock_reuseport and expose
reuseport_lock") has introduced lazy generation of reuseport group IDs that
survive group resize.

By comparing the identifier we check if BPF reuseport program is not trying
to select a socket from a BPF map that belongs to a different reuseport
group than the one the packet is for.

Because SOCKARRAY used to be the only BPF map type that can be used with
reuseport BPF, it was possible to delay the generation of reuseport group
ID until a socket from the group was inserted into BPF map for the first
time.

Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
either generate the reuseport ID on map update, like SOCKARRAY does, or
allocate an ID from the start when reuseport group gets created.

This patch takes the latter approach to keep sockmap free of calls into
reuseport code. This streamlines the reuseport_id access as its lifetime
now matches the longevity of reuseport object.

The cost of this simplification, however, is that we allocate reuseport IDs
for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
setups. With the way identifiers are currently generated, we can have at
most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
get close to the limit, we can switch an u64 counter like sk_cookie.

Another change is that we now always call into SOCKARRAY logic to unlink
the socket from the map when unhashing or closing the socket. Previously we
did it only when at least one socket from the group was in a BPF map.

It is worth noting that this doesn't conflict with sockmap tear-down in
case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
group. sockmap tear-down happens first:

  prot->unhash
  `- tcp_bpf_unhash
     |- tcp_bpf_remove
     |  `- while (sk_psock_link_pop(psock))
     |     `- sk_psock_unlink
     |        `- sock_map_delete_from_link
     |           `- __sock_map_delete
     |              `- sock_map_unref
     |                 `- sk_psock_put
     |                    `- sk_psock_drop
     |                       `- rcu_assign_sk_user_data(sk, NULL)
     `- inet_unhash
        `- reuseport_detach_sock
           `- bpf_sk_reuseport_detach
              `- WRITE_ONCE(sk->sk_user_data, NULL)

Suggested-by: Martin Lau <kafai@fb.com>
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
e80251555f tcp_bpf: Don't let child socket inherit parent protocol ops on copy
Prepare for cloning listening sockets that have their protocol callbacks
overridden by sk_msg. Child sockets must not inherit parent callbacks that
access state stored in sk_user_data owned by the parent.

Restore the child socket protocol callbacks before it gets hashed and any
of the callbacks can get invoked.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-4-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Jakub Sitnicki
f1ff5ce2cd net, sk_msg: Clear sk_user_data pointer on clone if tagged
sk_user_data can hold a pointer to an object that is not intended to be
shared between the parent socket and the child that gets a pointer copy on
clone. This is the case when sk_user_data points at reference-counted
object, like struct sk_psock.

One way to resolve it is to tag the pointer with a no-copy flag by
repurposing its lowest bit. Based on the bit-flag value we clear the child
sk_user_data pointer after cloning the parent socket.

The no-copy flag is stored in the pointer itself as opposed to externally,
say in socket flags, to guarantee that the pointer and the flag are copied
from parent to child socket in an atomic fashion. Parent socket state is
subject to change while copying, we don't hold any locks at that time.

This approach relies on an assumption that sk_user_data holds a pointer to
an object aligned at least 2 bytes. A manual audit of existing users of
rcu_dereference_sk_user_data helper confirms our assumption.

Also, an RCU-protected sk_user_data is not likely to hold a pointer to a
char value or a pathological case of "struct { char c; }". To be safe, warn
when the flag-bit is set when setting sk_user_data to catch any future
misuses.

It is worth considering why clearing sk_user_data unconditionally is not an
option. There exist users, DRBD, NVMe, and Xen drivers being among them,
that rely on the pointer being copied when cloning the listening socket.

Potentially we could distinguish these users by checking if the listening
socket has been created in kernel-space via sock_create_kern, and hence has
sk_kern_sock flag set. However, this is not the case for NVMe and Xen
drivers, which create sockets without marking them as belonging to the
kernel.

Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Link: https://lore.kernel.org/bpf/20200218171023.844439-3-jakub@cloudflare.com
2020-02-21 22:29:45 +01:00
Emmanuel Grumbach
7bb106eb68 cfg80211: remove support for adjacent channel compensation
The only driver that used that was iwlwifi and it removed
support for this. Remove the feature here as well.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Link: https://lore.kernel.org/r/20200214232336.a530de38e511.I393bc395f6037c8cca6421ed550e3072dc248aed@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-21 20:00:39 +01:00
Ilias Apalodimas
458de8a97f net: page_pool: API cleanup and comments
Functions starting with __ usually indicate those which are exported,
but should not be called directly. Update some of those declared in the
API and make it more readable.

page_pool_unmap_page() and page_pool_release_page() were doing
exactly the same thing calling __page_pool_clean_page().  Let's
rename __page_pool_clean_page() to page_pool_release_page() and
export it in order to show up on perf logs and get rid of
page_pool_unmap_page().

Finally rename __page_pool_put_page() to page_pool_put_page() since we
can now directly call it from drivers and rename the existing
page_pool_put_page() to page_pool_put_full_page() since they do the same
thing but the latter is trying to sync the full DMA area.

This patch also updates netsec, mvneta and stmmac drivers which use
those functions.

Suggested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-20 10:09:25 -08:00
Paul Blakey
7d17c544cd net: sched: Pass ingress block to tcf_classify_ingress
On ingress and cls_act qdiscs init, save the block on ingress
mini_Qdisc and and pass it on to ingress classification, so it
can be used for the looking up a specified chain index.

Co-developed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19 17:49:48 -08:00
Paul Blakey
9410c9409d net: sched: Introduce ingress classification function
TC multi chain configuration can cause offloaded tc chains to miss in
hardware after jumping to some chain. In such cases the software should
continue from the chain that missed in hardware, as the hardware may
have manipulated the packet and updated some counters.

Currently a single tcf classification function serves both ingress and
egress. However, multi chain miss processing (get tc skb extension on
hw miss, set tc skb extension on tc miss) should happen only on
ingress.

Refactor the code to use ingress classification function, and move setting
the tc skb extension from general classification to it, as a prestep
for supporting the hw miss scenario.

Co-developed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-19 17:49:48 -08:00
Luiz Augusto von Dentz
1c22d3cda8 Bluetooth: RFCOMM: Use MTU auto tune logic
This reuse the L2CAP MTU auto logic to select the MTU used for RFCOMM
channels, this should increase the maximum from 1013 to 1021 when 3-DH5
is supported.

Since it does not set an L2CAP MTU we no longer need a debugfs so that
is removed.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-19 20:31:45 +01:00
Heiner Kallweit
76eeb12b71 net: core: add helper tcp_v6_gso_csum_prep
Several network drivers for chips that support TSO6 share the same code
for preparing the TCP header, so let's factor it out to a helper.
A difference is that some drivers reset the payload_len whilst others
don't do this. This value is overwritten by TSO anyway, therefore
the new helper resets it in general.

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Reviewed-by: Alexander Duyck <alexander.h.duyck@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-19 11:20:59 -08:00
Raed Salem
dda520c4d4 ESP: Export esp_output_fill_trailer function
The esp fill trailer method is identical for both
IPv6 and IPv4.

Share the implementation for esp6 and esp to avoid
code duplication in addition it could be also used
at various drivers code.

Signed-off-by: Raed Salem <raeds@mellanox.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2020-02-19 13:52:32 +01:00
Aya Levin
573ed90aa5 devlink: Force enclosing array on binary fmsg data
Add a new API for start/end binary array brackets [] to force array
around binary data as required from JSON. With this restriction, re-open
API to set binary fmsg data.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2020-02-18 19:17:28 -08:00
Jason Baron
8a9093c798 net: sched: correct flower port blocking
tc flower rules that are based on src or dst port blocking are sometimes
ineffective due to uninitialized stack data. __skb_flow_dissect() extracts
ports from the skb for tc flower to match against. However, the port
dissection is not done when when the FLOW_DIS_IS_FRAGMENT bit is set in
key_control->flags. All callers of __skb_flow_dissect(), zero-out the
key_control field except for fl_classify() as used by the flower
classifier. Thus, the FLOW_DIS_IS_FRAGMENT may be set on entry to
__skb_flow_dissect(), since key_control is allocated on the stack
and may not be initialized.

Since key_basic and key_control are present for all flow keys, let's
make sure they are initialized.

Fixes: 62230715fd ("flow_dissector: do not dissect l4 ports for fragments")
Co-developed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 21:33:28 -08:00
Vlad Buslov
b15e7a6e8d net: sched: don't take rtnl lock during flow_action setup
Refactor tc_setup_flow_action() function not to use rtnl lock and remove
'rtnl_held' argument that is no longer needed.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:17:02 -08:00
Vlad Buslov
107f2d5091 net: sched: refactor ct action helpers to require tcf_lock
In order to remove rtnl lock dependency from flow_action representation
translator, change rtnl_dereference() to rcu_dereference_protected() in ct
action helpers that provide external access to zone and action values. This
is safe to do because the functions are not called from anywhere else
outside flow_action infrastructure which was modified to obtain tcf_lock
when accessing action data in one of previous patches in the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:17:02 -08:00
Vlad Buslov
487e9589cc net: sched: refactor police action helpers to require tcf_lock
In order to remove rtnl lock dependency from flow_action representation
translator, change rcu_dereference_bh_rtnl() to rcu_dereference_protected()
in police action helpers that provide external access to rate and burst
values. This is safe to do because the functions are not called from
anywhere else outside flow_action infrastructure which was modified to
obtain tcf_lock when accessing action data in one of previous patches in
the series.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:17:02 -08:00
Vlad Buslov
7a47281439 net: sched: lock action when translating it to flow_action infra
In order to remove dependency on rtnl lock, take action's tcfa_lock when
constructing its representation as flow_action_entry structure.

Refactor tcf_sample_get_group() to assume that caller holds tcf_lock and
don't take it manually. This callback is only called from flow_action infra
representation translator which now calls it with tcf_lock held, so this
refactoring is necessary to prevent deadlock.

Allocate memory with GFP_ATOMIC flag for ip_tunnel_info copy because
tcf_tunnel_info_copy() is only called from flow_action representation infra
code with tcf_lock spinlock taken.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-17 14:17:02 -08:00
Randy Dunlap
66256e0b15 net/sock.h: fix all kernel-doc warnings
Fix all kernel-doc warnings for <net/sock.h>.
Fixes these warnings:

../include/net/sock.h:232: warning: Function parameter or member 'skc_addrpair' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_portpair' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_ipv6only' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_net_refcnt' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_daddr' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_v6_rcv_saddr' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_cookie' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_listener' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_dr' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_rcv_wnd' not described in 'sock_common'
../include/net/sock.h:232: warning: Function parameter or member 'skc_tw_rcv_nxt' not described in 'sock_common'

../include/net/sock.h:498: warning: Function parameter or member 'sk_rx_skb_cache' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_wq_raw' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'tcp_rtx_queue' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_tx_skb_cache' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_route_forced_caps' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_txtime_report_errors' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_validate_xmit_skb' not described in 'sock'
../include/net/sock.h:498: warning: Function parameter or member 'sk_bpf_storage' not described in 'sock'

../include/net/sock.h:2024: warning: No description found for return value of 'sk_wmem_alloc_get'
../include/net/sock.h:2035: warning: No description found for return value of 'sk_rmem_alloc_get'
../include/net/sock.h:2046: warning: No description found for return value of 'sk_has_allocations'
../include/net/sock.h:2082: warning: No description found for return value of 'skwq_has_sleeper'
../include/net/sock.h:2244: warning: No description found for return value of 'sk_page_frag'
../include/net/sock.h:2444: warning: Function parameter or member 'tcp_rx_skb_cache_key' not described in 'DECLARE_STATIC_KEY_FALSE'
../include/net/sock.h:2444: warning: Excess function parameter 'sk' description in 'DECLARE_STATIC_KEY_FALSE'
../include/net/sock.h:2444: warning: Excess function parameter 'skb' description in 'DECLARE_STATIC_KEY_FALSE'

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:43:06 -08:00
David S. Miller
ddb535a6a0 A few big new things:
* 802.11 frame encapsulation offload support
  * more HE (802.11ax) support, including some for 6 GHz band
  * powersave in hwsim, for better testing
 
 Of course as usual there are various cleanups and small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5GggsACgkQB8qZga/f
 l8Qgdg//R42bSv94JYPcwZ5phgTgraCRZODWjBJq08n2T5m0EmEufgX79d9uEdgT
 u9npvn+ich5/VZhmuSbGrW9TT6/FPLAZyghV1fj79o971Qd7ky2Mp8G1fcTEbtDn
 IG2e9vauY9XDSb2O3wNj8dA8rAN/kLNmhsPqWxn2CgLPqjdbf+W15dvo4rnaL2gs
 ffGyE47dHuAFwCruyT8UPbw3iu4+tQhruN9eVg+UkU8rJGvEMqfrLK20zl1weIV9
 a7IuXdxacdsHO8Y+tl6GtvgOURQPpvf55+leLOUhcmHPJ3f/eAal6wmWRxDxs/qB
 IWSe8BC81cZZ5pYWk1A+0sXfJMlYjNsN0xw5SQRSrbgyb5saz8aLUIlHsOBM4iPH
 SwzCMN5A1GOPOUFsugzPwbiki9g6dh0/EC2NyXE4A26CAd967dVXTvTY5SMNgiB+
 bZaaUDaPQUm1jgDT5bLRhTipTHbekDkYzG/e+wNO+HKyStoEYM485MwY4MQCYzEh
 HKDmkAbFuCwEUeXXw1y8GybUknApCRru9FtY+oiN/+y/aESfB7HJfmDFFU/KYgPu
 HOuqJoNAxdMdycDCb24/cLjUiehzfM6sujwBxZOD5WHhAcXrBo5dGd6ibfurIrjj
 XI36/mwTiMtyyb0/5xM1AKvoic2j+a5YU3MB7KSc9TlaPa5j2NA=
 =CgmJ
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2020-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A few big new things:
 * 802.11 frame encapsulation offload support
 * more HE (802.11ax) support, including some for 6 GHz band
 * powersave in hwsim, for better testing

Of course as usual there are various cleanups and small fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-16 19:00:22 -08:00
Luiz Augusto von Dentz
eab2404ba7 Bluetooth: Add BT_PHY socket option
This adds BT_PHY socket option (read-only) which can be used to read
the PHYs in use by the underline connection.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-02-14 19:15:30 +01:00
David S. Miller
b32cb6fcf1 Just a few fixes:
* avoid running out of tracking space for frames that need
    to be reported to userspace by using more bits
  * fix beacon handling suppression by adding some relevant
    elements to the CRC calculation
  * fix quiet mode in action frames
  * fix crash in ethtool for virt_wifi and similar
  * add a missing policy entry
  * fix 160 & 80+80 bandwidth to take local capabilities into
    account
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl5GZQ8ACgkQB8qZga/f
 l8TQYQ//fcrDPndtwJWE5IaXJq7OmnM2NsYED6r1gVK7CWeVhqfwES6x9LVbhW/B
 QPwqFfL3i2II8XFVG/+BCUcEhp5FZQKGEDXaYcWXqQCuc6mU7ZKBuaEJmznMoez1
 6b96vdkVa78QnNK0QjSjAMua9KaBCIS2bPP/0E5o/8T9YsRKhuC4ItpLHYAGLsfM
 1QzU8ZBr/Ug4XIx28lDYOs8MHF5jm8dUgSpB4WgSQUKGTmgvyO/SefGtaJQ36R9B
 tAVR3Mmstm00ufsXT3OhiqXgN6bjUENvM+osVjD6E1eqjtPlrUYTCO8b/ZC1wean
 4G077eE186uao+9xQ8cLk7sL6ZJnlRwJ15RdBB5bsLZ8nG2FpclKCMxpcxc4PL/M
 oIXjEU8uNhSAVDmI3c2k/sxRW0hdQoblgl8PQMZ3k/6nvpwnIrG8yoeAwP4B457Y
 vgiRE212YkOaXhtwrg1EuOIhk1N3YRG8DpSdDrKlj+y0LOijTrHCnqoUwnne+RpX
 2Zqm5Sz3R2Nel9iIVs1JOxIGvyg+KIoM+WjmmrjMSqlrvMMisq8O3BuMfgd6jd+t
 NuZE9MkukBPMiaB4QDMDgBcaTMrLRItD+SSSQcMvkORTfGrpjUhpiQTVwOPXCYqZ
 793BU8kI+gwrz9UfloYxSv2+2H12a6LDQY6Q/mJVKAuhdw6SXsE=
 =L+Ik
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-02-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a few fixes:
 * avoid running out of tracking space for frames that need
   to be reported to userspace by using more bits
 * fix beacon handling suppression by adding some relevant
   elements to the CRC calculation
 * fix quiet mode in action frames
 * fix crash in ethtool for virt_wifi and similar
 * add a missing policy entry
 * fix 160 & 80+80 bandwidth to take local capabilities into
   account
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-14 07:16:08 -08:00
John Crispin
1f6e0baa70 mac80211: allow setting queue_len for drivers not using wake_tx_queue
Currently a mac80211 driver can only set the txq_limit when using
wake_tx_queue. Not all drivers use wake_tx_queue. This patch adds a new
element to wiphy allowing a driver to set a custom tx_queue_len and the
code that will apply it in case it is set. The current default is
1000 which is too low for ath11k when doing HE rates.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20200211122605.13002-1-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:59:35 +01:00
Ben Greear
db6d9e9e8b mac80211: Fix setting txpower to zero
With multiple VIFS ath10k, and probably others, tries to find the
minimum txpower for all vifs and uses that when setting txpower in
the firmware.

If a second vif is added and starts to scan, it's txpower is not
initialized yet and it set to zero.

ath10k had a patch to ignore zero values, but then it is impossible
to actually set txpower to zero.

So, instead initialize the txpower to INT_MIN in mac80211, and let
drivers know that means the power has not been set and so should
be ignored.

This should fix regression in:

commit 88407beb1b
Author: Ryan Hsu <ryanhsu@qca.qualcomm.com>
Date:   Tue Dec 13 14:55:19 2016 -0800

    ath10k: fix incorrect txpower set by P2P_DEVICE interface

Tested on ath10k 9984 with ath10k-ct firmware.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20191217183057.24586-1-greearb@candelatech.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-14 09:57:00 +01:00
Jason A. Donenfeld
0b41713b60 icmp: introduce helper for nat'd source address in network device context
This introduces a helper function to be called only by network drivers
that wraps calls to icmp[v6]_send in a conntrack transformation, in case
NAT has been used. We don't want to pollute the non-driver path, though,
so we introduce this as a helper to be called by places that actually
make use of this, as suggested by Florian.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:19:00 -08:00
Hangbin Liu
6ee2deb6fb net/flow_dissector: remove unexist field description
@thoff has moved to struct flow_dissector_key_control.

Fixes: 42aecaa9bb ("net: Get skb hash over flow_keys structure")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-13 14:13:27 -08:00
Markus Theil
8c3ed7aa2b nl80211: add src and dst addr attributes for control port tx/rx
When using control port over nl80211 in AP mode with
pre-authentication, APs need to forward frames to other
APs defined by their MAC address. Before this patch,
pre-auth frames reaching user space over nl80211 control
port  have no longer any information about the dest attached,
which can be used for forwarding to a controller or injecting
the frame back to a ethernet interface over a AF_PACKET
socket.
Analog problems exist, when forwarding pre-auth frames from
AP -> STA.

This patch therefore adds the NL80211_ATTR_DST_MAC and
NL80211_ATTR_SRC_MAC attributes to provide more context
information when forwarding.
The respective arguments are optional on tx and included on rx.
Therefore unaware existing software is not affected.

Software which wants to detect this feature, can do so
by checking against:
  NL80211_EXT_FEATURE_CONTROL_PORT_OVER_NL80211_MAC_ADDRS

Signed-off-by: Markus Theil <markus.theil@tu-ilmenau.de>
Link: https://lore.kernel.org/r/20200115125522.3755-1-markus.theil@tu-ilmenau.de
[split into separate cfg80211/mac80211 patches]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:58:37 +01:00
Veerendranath Jakkam
d6039a3416 cfg80211: Enhance the AKM advertizement to support per interface.
Commit ab4dfa2053 ("cfg80211: Allow drivers to advertise supported AKM
suites") introduces the support to advertize supported AKMs to userspace.

This needs an enhancement to advertize the AKM support per interface type,
specifically for the cfg80211-based drivers that implement SME and use
different mechanisms to support the AKM's for each interface type (e.g.,
the support for SAE, OWE AKM's take different paths for such drivers on
STA/AP mode).

This commit aims the same and enhances the earlier mechanism of advertizing
the AKMs per wiphy. Add new nl80211 attributes and data structure to
provide supported AKMs per interface type to userspace.

the AKMs advertized in akm_suites are default capabilities if not
advertized for a specific interface type in iftype_akm_suites.

Signed-off-by: Veerendranath Jakkam <vjakkam@codeaurora.org>
Link: https://lore.kernel.org/r/20200126203032.21934-1-vjakkam@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:34:26 +01:00
Haim Dreyfuss
1e61d82cca cfg80211: add no HE indication to the channel flag
The regulatory domain might forbid HE operation.  Certain regulatory
domains may restrict it for specific channels whereas others may do it
for the whole regulatory domain.

Add an option to indicate it in the channel flag.

Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Link: https://lore.kernel.org/r/20200121081213.733757-1-luca@coelho.fi
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 12:34:09 +01:00
Johannes Berg
f2b18baca9 mac80211: use more bits for ack_frame_id
It turns out that this wasn't a good idea, I hit a test failure in
hwsim due to this. That particular failure was easily worked around,
but it raised questions: if an AP needs to, for example, send action
frames to each connected station, the current limit is nowhere near
enough (especially if those stations are sleeping and the frames are
queued for a while.)

Shuffle around some bits to make more room for ack_frame_id to allow
up to 8192 queued up frames, that's enough for queueing 4 frames to
each connected station, even at the maximum of 2007 stations on a
single AP.

We take the bits from band (which currently only 2 but I leave 3 in
case we add another band) and from the hw_queue, which can only need
4 since it has a limit of 16 queues.

Fixes: 6912daed05 ("mac80211: Shrink the size of ack_frame_id to make room for tx_time_est")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20200115122549.b9a4ef9f4980.Ied52ed90150220b83a280009c590b65d125d087c@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-02-07 11:32:27 +01:00
Eric Dumazet
38f88c4540 bonding/alb: properly access headers in bond_alb_xmit()
syzbot managed to send an IPX packet through bond_alb_xmit()
and af_packet and triggered a use-after-free.

First, bond_alb_xmit() was using ipx_hdr() helper to reach
the IPX header, but ipx_hdr() was using the transport offset
instead of the network offset. In the particular syzbot
report transport offset was 0xFFFF

This patch removes ipx_hdr() since it was only (mis)used from bonding.

Then we need to make sure IPv4/IPv6/IPX headers are pulled
in skb->head before dereferencing anything.

BUG: KASAN: use-after-free in bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
Read of size 2 at addr ffff8801ce56dfff by task syz-executor.2/18108
 (if (ipx_hdr(skb)->ipx_checksum != IPX_NO_CHECKSUM) ...)

Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 [<ffffffff8441fc42>] __dump_stack lib/dump_stack.c:17 [inline]
 [<ffffffff8441fc42>] dump_stack+0x14d/0x20b lib/dump_stack.c:53
 [<ffffffff81a7dec4>] print_address_description+0x6f/0x20b mm/kasan/report.c:282
 [<ffffffff81a7e0ec>] kasan_report_error mm/kasan/report.c:380 [inline]
 [<ffffffff81a7e0ec>] kasan_report mm/kasan/report.c:438 [inline]
 [<ffffffff81a7e0ec>] kasan_report.cold+0x8c/0x2a0 mm/kasan/report.c:422
 [<ffffffff81a7dc4f>] __asan_report_load_n_noabort+0xf/0x20 mm/kasan/report.c:469
 [<ffffffff82c8c00a>] bond_alb_xmit+0x153a/0x1590 drivers/net/bonding/bond_alb.c:1452
 [<ffffffff82c60c74>] __bond_start_xmit drivers/net/bonding/bond_main.c:4199 [inline]
 [<ffffffff82c60c74>] bond_start_xmit+0x4f4/0x1570 drivers/net/bonding/bond_main.c:4224
 [<ffffffff83baa558>] __netdev_start_xmit include/linux/netdevice.h:4525 [inline]
 [<ffffffff83baa558>] netdev_start_xmit include/linux/netdevice.h:4539 [inline]
 [<ffffffff83baa558>] xmit_one net/core/dev.c:3611 [inline]
 [<ffffffff83baa558>] dev_hard_start_xmit+0x168/0x910 net/core/dev.c:3627
 [<ffffffff83bacf35>] __dev_queue_xmit+0x1f55/0x33b0 net/core/dev.c:4238
 [<ffffffff83bae3a8>] dev_queue_xmit+0x18/0x20 net/core/dev.c:4278
 [<ffffffff84339189>] packet_snd net/packet/af_packet.c:3226 [inline]
 [<ffffffff84339189>] packet_sendmsg+0x4919/0x70b0 net/packet/af_packet.c:3252
 [<ffffffff83b1ac0c>] sock_sendmsg_nosec net/socket.c:673 [inline]
 [<ffffffff83b1ac0c>] sock_sendmsg+0x12c/0x160 net/socket.c:684
 [<ffffffff83b1f5a2>] __sys_sendto+0x262/0x380 net/socket.c:1996
 [<ffffffff83b1f700>] SYSC_sendto net/socket.c:2008 [inline]
 [<ffffffff83b1f700>] SyS_sendto+0x40/0x60 net/socket.c:2004

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Jay Vosburgh <j.vosburgh@gmail.com>
Cc: Veaceslav Falico <vfalico@gmail.com>
Cc: Andy Gospodarek <andy@greyhouse.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-02-05 14:28:09 +01:00
Geert Uytterhoeven
31484d56ca mptcp: Fix undefined mptcp_handle_ipv6_mapped for modular IPV6
If CONFIG_MPTCP=y, CONFIG_MPTCP_IPV6=n, and CONFIG_IPV6=m:

    ERROR: "mptcp_handle_ipv6_mapped" [net/ipv6/ipv6.ko] undefined!

This does not happen if CONFIG_MPTCP_IPV6=y, as CONFIG_MPTCP_IPV6
selects CONFIG_IPV6, and thus forces CONFIG_IPV6 builtin.

As exporting a symbol for an empty function would be a bit wasteful, fix
this by providing a dummy version of mptcp_handle_ipv6_mapped() for the
CONFIG_MPTCP_IPV6=n case.

Rename mptcp_handle_ipv6_mapped() to mptcpv6_handle_mapped(), to make it
clear this is a pure-IPV6 function, just like mptcpv6_init().

Fixes: cec37a6e41 ("mptcp: Handle MP_CAPABLE options for outgoing connections")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 10:55:54 +01:00
Willem de Bruijn
d0208bf4da udp: document udp_rcv_segment special case for looped packets
Commit 6cd021a58c ("udp: segment looped gso packets correctly")
fixes an issue with rare udp gso multicast packets looped onto the
receive path.

The stable backport makes the narrowest change to target only these
packets, when needed. As opposed to, say, expanding __udp_gso_segment,
which is harder to reason to be free from unintended side-effects.

But the resulting code is hardly self-describing.
Document its purpose and rationale.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-30 09:59:53 +01:00
Willem de Bruijn
6cd021a58c udp: segment looped gso packets correctly
Multicast and broadcast packets can be looped from egress to ingress
pre segmentation with dev_loopback_xmit. That function unconditionally
sets ip_summed to CHECKSUM_UNNECESSARY.

udp_rcv_segment segments gso packets in the udp rx path. Segmentation
usually executes on egress, and does not expect packets of this type.
__udp_gso_segment interprets !CHECKSUM_PARTIAL as CHECKSUM_NONE. But
the offsets are not correct for gso_make_checksum.

UDP GSO packets are of type CHECKSUM_PARTIAL, with their uh->check set
to the correct pseudo header checksum. Reset ip_summed to this type.
(CHECKSUM_PARTIAL is allowed on ingress, see comments in skbuff.h)

Reported-by: syzbot <syzkaller@googlegroups.com>
Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-28 10:56:51 +01:00
Vasundhara Volam
41c0d917d1 devlink: add macro for "fw.roce"
Add definition and documentation for the new generic info "fw.roce".

v2: Remove board.nvm_cfg since fw.psid is similar.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:33:29 +01:00
David S. Miller
c4c57b974d Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2020-01-26

Here's (probably) the last bluetooth-next pull request for the 5.6 kernel.

 - Initial pieces of Bluetooth 5.2 Isochronous Channels support
 - mgmt: Various cleanups and a new Set Blocked Keys command
 - btusb: Added support for 04ca:3021 QCA_ROME device
 - hci_qca: Multiple fixes & cleanups
 - hci_bcm: Fixes & improved device tree support
 - Fixed attempts to create duplicate debugfs entries

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:24:46 +01:00
Steffen Klassert
9fd1ff5d2a udp: Support UDP fraglist GRO/GSO.
This patch extends UDP GRO to support fraglist GRO/GSO
by using the previously introduced infrastructure.
If the feature is enabled, all UDP packets are going to
fraglist GRO (local input and forward).

After validating the csum,  we mark ip_summed as
CHECKSUM_UNNECESSARY for fraglist GRO packets to
make sure that the csum is not touched.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 11:00:21 +01:00
Cong Wang
2e24cd7555 net_sched: fix ops->bind_class() implementations
The current implementations of ops->bind_class() are merely
searching for classid and updating class in the struct tcf_result,
without invoking either of cl_ops->bind_tcf() or
cl_ops->unbind_tcf(). This breaks the design of them as qdisc's
like cbq use them to count filters too. This is why syzbot triggered
the warning in cbq_destroy_class().

In order to fix this, we have to call cl_ops->bind_tcf() and
cl_ops->unbind_tcf() like the filter binding path. This patch does
so by refactoring out two helper functions __tcf_bind_filter()
and __tcf_unbind_filter(), which are lockless and accept a Qdisc
pointer, then teaching each implementation to call them correctly.

Note, we merely pass the Qdisc pointer as an opaque pointer to
each filter, they only need to pass it down to the helper
functions without understanding it at all.

Fixes: 07d79fc7d9 ("net_sched: add reverse binding for tc class")
Reported-and-tested-by: syzbot+0a0596220218fcb603a8@syzkaller.appspotmail.com
Reported-and-tested-by: syzbot+63bdb6006961d8c917c6@syzkaller.appspotmail.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-27 10:51:43 +01:00
Stefano Brivio
3c4287f620 nf_tables: Add set type for arbitrary concatenation of ranges
This new set type allows for intervals in concatenated fields,
which are expressed in the usual way, that is, simple byte
concatenation with padding to 32 bits for single fields, and
given as ranges by specifying start and end elements containing,
each, the full concatenation of start and end values for the
single fields.

Ranges are expanded to composing netmasks, for each field: these
are inserted as rules in per-field lookup tables. Bits to be
classified are divided in 4-bit groups, and for each group, the
lookup table contains 4^2 buckets, representing all the possible
values of a bit group. This approach was inspired by the Grouper
algorithm:
	http://www.cse.usf.edu/~ligatti/projects/grouper/

Matching is performed by a sequence of AND operations between
bucket values, with buckets selected according to the value of
packet bits, for each group. The result of this sequence tells
us which rules matched for a given field.

In order to concatenate several ranged fields, per-field rules
are mapped using mapping arrays, one per field, that specify
which rules should be considered while matching the next field.
The mapping array for the last field contains a reference to
the element originally inserted.

The notes in nft_set_pipapo.c cover the algorithm in deeper
detail.

A pure hash-based approach is of no use here, as ranges need
to be classified. An implementation based on "proxying" the
existing red-black tree set type, creating a tree for each
field, was considered, but deemed impractical due to the fact
that elements would need to be shared between trees, at least
as long as we want to keep UAPI changes to a minimum.

A stand-alone implementation of this algorithm is available at:
	https://pipapo.lameexcu.se
together with notes about possible future optimisations
(in pipapo.c).

This algorithm was designed with data locality in mind, and can
be highly optimised for SIMD instruction sets, as the bulk of
the matching work is done with repetitive, simple bitwise
operations.

At this point, without further optimisations, nft_concat_range.sh
reports, for one AMD Epyc 7351 thread (2.9GHz, 512 KiB L1D$, 8 MiB
L2$):

TEST: performance
  net,port                                                      [ OK ]
    baseline (drop from netdev hook):              10190076pps
    baseline hash (non-ranged entries):             6179564pps
    baseline rbtree (match on first field only):    2950341pps
    set with  1000 full, ranged entries:            2304165pps
  port,net                                                      [ OK ]
    baseline (drop from netdev hook):              10143615pps
    baseline hash (non-ranged entries):             6135776pps
    baseline rbtree (match on first field only):    4311934pps
    set with   100 full, ranged entries:            4131471pps
  net6,port                                                     [ OK ]
    baseline (drop from netdev hook):               9730404pps
    baseline hash (non-ranged entries):             4809557pps
    baseline rbtree (match on first field only):    1501699pps
    set with  1000 full, ranged entries:            1092557pps
  port,proto                                                    [ OK ]
    baseline (drop from netdev hook):              10812426pps
    baseline hash (non-ranged entries):             6929353pps
    baseline rbtree (match on first field only):    3027105pps
    set with 30000 full, ranged entries:             284147pps
  net6,port,mac                                                 [ OK ]
    baseline (drop from netdev hook):               9660114pps
    baseline hash (non-ranged entries):             3778877pps
    baseline rbtree (match on first field only):    3179379pps
    set with    10 full, ranged entries:            2082880pps
  net6,port,mac,proto                                           [ OK ]
    baseline (drop from netdev hook):               9718324pps
    baseline hash (non-ranged entries):             3799021pps
    baseline rbtree (match on first field only):    1506689pps
    set with  1000 full, ranged entries:             783810pps
  net,mac                                                       [ OK ]
    baseline (drop from netdev hook):              10190029pps
    baseline hash (non-ranged entries):             5172218pps
    baseline rbtree (match on first field only):    2946863pps
    set with  1000 full, ranged entries:            1279122pps

v4:
 - fix build for 32-bit architectures: 64-bit division needs
   div_u64() (kbuild test robot <lkp@intel.com>)
v3:
 - rework interface for field length specification,
   NFT_SET_SUBKEY disappears and information is stored in
   description
 - remove scratch area to store closing element of ranges,
   as elements now come with an actual attribute to specify
   the upper range limit (Pablo Neira Ayuso)
 - also remove pointer to 'start' element from mapping table,
   closing key is now accessible via extension data
 - use bytes right away instead of bits for field lengths,
   this way we can also double the inner loop of the lookup
   function to take care of upper and lower bits in a single
   iteration (minor performance improvement)
 - make it clearer that set operations are actually atomic
   API-wise, but we can't e.g. implement flush() as one-shot
   action
 - fix type for 'dup' in nft_pipapo_insert(), check for
   duplicates only in the next generation, and in general take
   care of differentiating generation mask cases depending on
   the operation (Pablo Neira Ayuso)
 - report C implementation matching rate in commit message, so
   that AVX2 implementation can be compared (Pablo Neira Ayuso)
v2:
 - protect access to scratch maps in nft_pipapo_lookup() with
   local_bh_disable/enable() (Florian Westphal)
 - drop rcu_read_lock/unlock() from nft_pipapo_lookup(), it's
   already implied (Florian Westphal)
 - explain why partial allocation failures don't need handling
   in pipapo_realloc_scratch(), rename 'm' to clone and update
   related kerneldoc to make it clear we're not operating on
   the live copy (Florian Westphal)
 - add expicit check for priv->start_elem in
   nft_pipapo_insert() to avoid ending up in nft_pipapo_walk()
   with a NULL start element, and also zero it out in every
   operation that might make it invalid, so that insertion
   doesn't proceed with an invalid element (Florian Westphal)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
Stefano Brivio
f3a2181e16 netfilter: nf_tables: Support for sets with multiple ranged fields
Introduce a new nested netlink attribute, NFTA_SET_DESC_CONCAT, used
to specify the length of each field in a set concatenation.

This allows set implementations to support concatenation of multiple
ranged items, as they can divide the input key into matching data for
every single field. Such set implementations would be selected as
they specify support for NFT_SET_INTERVAL and allow desc->field_count
to be greater than one. Explicitly disallow this for nft_set_rbtree.

In order to specify the interval for a set entry, userspace would
include in NFTA_SET_DESC_CONCAT attributes field lengths, and pass
range endpoints as two separate keys, represented by attributes
NFTA_SET_ELEM_KEY and NFTA_SET_ELEM_KEY_END.

While at it, export the number of 32-bit registers available for
packet matching, as nftables will need this to know the maximum
number of field lengths that can be specified.

For example, "packets with an IPv4 address between 192.0.2.0 and
192.0.2.42, with destination port between 22 and 25", can be
expressed as two concatenated elements:

  NFTA_SET_ELEM_KEY:            192.0.2.0 . 22
  NFTA_SET_ELEM_KEY_END:        192.0.2.42 . 25

and NFTA_SET_DESC_CONCAT attribute would contain:

  NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN:		4
  NFTA_LIST_ELEM
    NFTA_SET_FIELD_LEN:		2

v4: No changes
v3: Complete rework, NFTA_SET_DESC_CONCAT instead of NFTA_SET_SUBKEY
v2: No changes

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
Pablo Neira Ayuso
7b225d0b5c netfilter: nf_tables: add NFTA_SET_ELEM_KEY_END attribute
Add NFTA_SET_ELEM_KEY_END attribute to convey the closing element of the
interval between kernel and userspace.

This patch also adds the NFT_SET_EXT_KEY_END extension to store the
closing element value in this interval.

v4: No changes
v3: New patch

[sbrivio: refactor error paths and labels; add corresponding
  nft_set_ext_type for new key; rebase]
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-27 08:54:30 +01:00
David S. Miller
4d8773b68e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in mlx5 because changes happened to code that has
moved meanwhile.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-26 10:40:21 +01:00
Petr Machata
ef6aadcc76 net: sched: Make TBF Qdisc offloadable
Invoke ndo_setup_tc as appropriate to signal init / replacement, destroying
and dumping of TBF Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 10:56:31 +01:00
Florian Westphal
e42f1ac626 mptcp: do not inherit inet proto ops
We need to initialise the struct ourselves, else we expose tcp-specific
callbacks such as tcp_splice_read which will then trigger splat because
the socket is an mptcp one:

BUG: KASAN: slab-out-of-bounds in tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
Write of size 8 at addr ffff888116aa21d0 by task syz-executor.0/5478

CPU: 1 PID: 5478 Comm: syz-executor.0 Not tainted 5.5.0-rc6 #3
Call Trace:
 tcp_mstamp_refresh+0x80/0xa0 net/ipv4/tcp_output.c:57
 tcp_rcv_space_adjust+0x72/0x7f0 net/ipv4/tcp_input.c:612
 tcp_read_sock+0x622/0x990 net/ipv4/tcp.c:1674
 tcp_splice_read+0x20b/0xb40 net/ipv4/tcp.c:791
 do_splice+0x1259/0x1560 fs/splice.c:1205

To prevent build error with ipv6, add the recv/sendmsg function
declaration to ipv6.h.  The functions are already accessible "thanks"
to retpoline related work, but they are currently only made visible
by socket.c specific INDIRECT_CALLABLE macros.

Reported-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-25 08:14:39 +01:00
Pablo Neira Ayuso
eb014de4fd netfilter: nf_tables: autoload modules from the abort path
This patch introduces a list of pending module requests. This new module
list is composed of nft_module_request objects that contain the module
name and one status field that tells if the module has been already
loaded (the 'done' field).

In the first pass, from the preparation phase, the netlink command finds
that a module is missing on this list. Then, a module request is
allocated and added to this list and nft_request_module() returns
-EAGAIN. This triggers the abort path with the autoload parameter set on
from nfnetlink, request_module() is called and the module request enters
the 'done' state. Since the mutex is released when loading modules from
the abort phase, the module list is zapped so this is iteration occurs
over a local list. Therefore, the request_module() calls happen when
object lists are in consistent state (after fulling aborting the
transaction) and the commit list is empty.

On the second pass, the netlink command will find that it already tried
to load the module, so it does not request it again and
nft_request_module() returns 0. Then, there is a look up to find the
object that the command was missing. If the module was successfully
loaded, the command proceeds normally since it finds the missing object
in place, otherwise -ENOENT is reported to userspace.

This patch also updates nfnetlink to include the reason to enter the
abort phase, which is required for this new autoload module rationale.

Fixes: ec7470b834 ("netfilter: nf_tables: store transaction list locally while requesting module")
Reported-by: syzbot+29125d208b3dae9a7019@syzkaller.appspotmail.com
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-24 20:54:29 +01:00
Christoph Paasch
cc7972ea19 mptcp: parse and emit MP_CAPABLE option according to v1 spec
This implements MP_CAPABLE options parsing and writing according
to RFC 6824 bis / RFC 8684: MPTCP v1.

Local key is sent on syn/ack, and both keys are sent on 3rd ack.
MP_CAPABLE messages len are updated accordingly. We need the skbuff to
correctly emit the above, so we push the skbuff struct as an argument
all the way from tcp code to the relevant mptcp callbacks.

When processing incoming MP_CAPABLE + data, build a full blown DSS-like
map info, to simplify later processing.  On child socket creation, we
need to record the remote key, if available.

Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:08 +01:00
Mat Martineau
648ef4b886 mptcp: Implement MPTCP receive path
Parses incoming DSS options and populates outgoing MPTCP ACK
fields. MPTCP fields are parsed from the TCP option header and placed in
an skb extension, allowing the upper MPTCP layer to access MPTCP
options after the skb has gone through the TCP stack.

The subflow implements its own data_ready() ops, which ensures that the
pending data is in sequence - according to MPTCP seq number - dropping
out-of-seq skbs. The DATA_READY bit flag is set if this is the case.
This allows the MPTCP socket layer to determine if more data is
available without having to consult the individual subflows.

It additionally validates the current mapping and propagates EoF events
to the connection socket.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau
6d0060f600 mptcp: Write MPTCP DSS headers to outgoing data packets
Per-packet metadata required to write the MPTCP DSS option is written to
the skb_ext area. One write to the socket may contain more than one
packet of data, which is copied to page fragments and mapped in to MPTCP
DSS segments with size determined by the available page fragments and
the maximum mapping length allowed by the MPTCP specification. If
do_tcp_sendpages() splits a DSS segment in to multiple skbs, that's ok -
the later skbs can either have duplicated DSS mapping information or
none at all, and the receiver can handle that.

The current implementation uses the subflow frag cache and tcp
sendpages to avoid excessive code duplication. More work is required to
ensure that it works correctly under memory pressure and to support
MPTCP-level retransmissions.

The MPTCP DSS checksum is not yet implemented.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
cec37a6e41 mptcp: Handle MP_CAPABLE options for outgoing connections
Add hooks to tcp_output.c to add MP_CAPABLE to an outgoing SYN request,
to capture the MP_CAPABLE in the received SYN-ACK, to add MP_CAPABLE to
the final ACK of the three-way handshake.

Use the .sk_rx_dst_set() handler in the subflow proto to capture when the
responding SYN-ACK is received and notify the MPTCP connection layer.

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Peter Krystad
eda7acddf8 mptcp: Handle MPTCP TCP options
Add hooks to parse and format the MP_CAPABLE option.

This option is handled according to MPTCP version 0 (RFC6824).
MPTCP version 1 MP_CAPABLE (RFC6824bis/RFC8684) will be added later in
coordination with related code changes.

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mat Martineau
f870fa0b57 mptcp: Add MPTCP socket stubs
Implements the infrastructure for MPTCP sockets.

MPTCP sockets open one in-kernel TCP socket per subflow. These subflow
sockets are only managed by the MPTCP socket that owns them and are not
visible from userspace. This commit allows a userspace program to open
an MPTCP socket with:

  sock = socket(AF_INET, SOCK_STREAM, IPPROTO_MPTCP);

The resulting socket is simply a wrapper around a single regular TCP
socket, without any of the MPTCP protocol implemented over the wire.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-24 13:44:07 +01:00
Mohit P. Tahiliani
ec97ecf1eb net: sched: add Flow Queue PIE packet scheduler
Principles:
  - Packets are classified on flows.
  - This is a Stochastic model (as we use a hash, several flows might
                                be hashed to the same slot)
  - Each flow has a PIE managed queue.
  - Flows are linked onto two (Round Robin) lists,
    so that new flows have priority on old ones.
  - For a given flow, packets are not reordered.
  - Drops during enqueue only.
  - ECN capability is off by default.
  - ECN threshold (if ECN is enabled) is at 10% by default.
  - Uses timestamps to calculate queue delay by default.

Usage:
tc qdisc ... fq_pie [ limit PACKETS ] [ flows NUMBER ]
                    [ target TIME ] [ tupdate TIME ]
                    [ alpha NUMBER ] [ beta NUMBER ]
                    [ quantum BYTES ] [ memory_limit BYTES ]
                    [ ecnprob PERCENTAGE ] [ [no]ecn ]
                    [ [no]bytemode ] [ [no_]dq_rate_estimator ]

defaults:
  limit: 10240 packets, flows: 1024
  target: 15 ms, tupdate: 15 ms (in jiffies)
  alpha: 1/8, beta : 5/4
  quantum: device MTU, memory_limit: 32 Mb
  ecnprob: 10%, ecn: off
  bytemode: off, dq_rate_estimator: off

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Sachin D. Patil <sdp.sachin@gmail.com>
Signed-off-by: V. Saicharan <vsaicharan1998@gmail.com>
Signed-off-by: Mohit Bhasi <mohitbhasi1998@gmail.com>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
5205ea00cd net: sched: pie: export symbols to be reused by FQ-PIE
This patch makes the drop_early(), calculate_probability() and
pie_process_dequeue() functions generic enough to be used by
both PIE and FQ-PIE (to be added in a future commit). The major
change here is in the way the functions take in arguments. This
patch exports these functions and makes FQ-PIE dependent on
sch_pie.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
b42a3d7c7c pie: improve comments and commenting style
Improve the comments along with the commenting style used to
describe the members of the structures and their initial values
in the init functions.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
2dfb1952a9 pie: rearrange structure members and their initializations
Rearrange the members of the structure such that closely
referenced members appear together and/or fit in the same
cacheline. Also, change the order of their initializations to
match the order in which they appear in the structure.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
1dbfc5e071 pie: use u8 instead of bool in pie_vars
Linux best practice recommends using u8 for true/false values in
structures.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
cf4eeee5ff pie: rearrange macros in order of length
Rearrange macros in order of length and align the values to
improve readability.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
805a5a23a4 pie: use U64_MAX to denote (2^64 - 1)
Use the U64_MAX macro to denote the constant (2^64 - 1).

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:31 +01:00
Mohit P. Tahiliani
84bf557fb0 net: sched: pie: move common code to pie.h
This patch moves macros, structures and small functions common
to PIE and FQ-PIE (to be added in a future commit) from the file
net/sched/sch_pie.c to the header file include/net/pie.h.
All the moved functions are made inline.

Signed-off-by: Mohit P. Tahiliani <tahiliani@nitk.edu.in>
Signed-off-by: Leslie Monis <lesliemonis@gmail.com>
Signed-off-by: Gautam Ramakrishnan <gautamramk@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 11:38:30 +01:00
David S. Miller
954b3c4397 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-01-22

The following pull-request contains BPF updates for your *net-next* tree.

We've added 92 non-merge commits during the last 16 day(s) which contain
a total of 320 files changed, 7532 insertions(+), 1448 deletions(-).

The main changes are:

1) function by function verification and program extensions from Alexei.

2) massive cleanup of selftests/bpf from Toke and Andrii.

3) batched bpf map operations from Brian and Yonghong.

4) tcp congestion control in bpf from Martin.

5) bulking for non-map xdp_redirect form Toke.

6) bpf_send_signal_thread helper from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-23 08:10:16 +01:00
Björn Töpel
43a825afc9 xsk, net: Make sock_def_readable() have external linkage
XDP sockets use the default implementation of struct sock's
sk_data_ready callback, which is sock_def_readable(). This function
is called in the XDP socket fast-path, and involves a retpoline. By
letting sock_def_readable() have external linkage, and being called
directly, the retpoline can be avoided.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20200120092917.13949-1-bjorn.topel@gmail.com
2020-01-22 00:08:52 +01:00
David S. Miller
4f2c17e0f3 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2020-01-21

1) Add support for TCP encapsulation of IKE and ESP messages,
   as defined by RFC 8229. Patchset from Sabrina Dubroca.

Please note that there is a merge conflict in:

net/unix/af_unix.c

between commit:

3c32da19a8 ("unix: Show number of pending scm files of receive queue in fdinfo")

from the net-next tree and commit:

b50b0580d2 ("net: add queue argument to __skb_wait_for_more_packets and __skb_{,try_}recv_datagram")

from the ipsec-next tree.

The conflict can be solved as done in linux-next.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-21 12:18:20 +01:00
David S. Miller
b3f7e3f23a Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net 2020-01-19 22:10:04 +01:00
Amit Cohen
c3cae4916e devlink: Add overlay source MAC is multicast trap
Add packet trap that can report NVE packets that the device decided to
drop because their overlay source MAC is multicast.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:23:52 +01:00
Amit Cohen
13c056ec7d devlink: Add tunnel generic packet traps
Add packet traps that can report packets that were dropped during tunnel
decapsulation.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:23:52 +01:00
Amit Cohen
95f0ead8f0 devlink: Add non-routable packet trap
Add packet trap that can report packets that reached the router, but are
non-routable. For example, IGMP queries can be flooded by the device in
layer 2 and reach the router. Such packets should not be routed and
instead dropped.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 16:23:52 +01:00
David S. Miller
7f013edeba Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next, they are:

1) Incorrect uapi header comment in bitwise, from Jeremy Sowden.

2) Fetch flow statistics if flow is still active.

3) Restrict flow matching on hardware based on input device.

4) Add nf_flow_offload_work_alloc() helper function.

5) Remove the last client of the FLOW_OFFLOAD_DYING flag, use teardown
   instead.

6) Use atomic bitwise operation to operate with flow flags.

7) Add nf_flowtable_hw_offload() helper function to check for the
   NF_FLOWTABLE_HW_OFFLOAD flag.

8) Add NF_FLOW_HW_REFRESH to retry hardware offload from the flowtable
   software datapath.

9) Remove indirect calls in xt_hashlimit, from Florian Westphal.

10) Add nf_flow_offload_tuple() helper to consolidate code.

11) Add nf_flow_table_offload_cmd() helper function.

12) A few whitespace cleanups in nf_tables in bitwise and the bitmap/hash
    set types, from Jeremy Sowden.

13) Cleanup netlink attribute checks in bitwise, from Jeremy Sowden.

14) Replace goto by return in error path of nft_bitwise_dump(), from
    Jeremy Sowden.

15) Add bitwise operation netlink attribute, also from Jeremy.

16) Add nft_bitwise_init_bool(), from Jeremy Sowden.

17) Add nft_bitwise_eval_bool(), also from Jeremy.

18) Add nft_bitwise_dump_bool(), from Jeremy Sowden.

19) Disallow hardware offload for other that NFT_BITWISE_BOOL,
    from Jeremy Sowden.

20) Add NFTA_BITWISE_DATA netlink attribute, again from Jeremy.

21) Add support for bitwise shift operation, from Jeremy Sowden.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-19 10:29:05 +01:00
Guillaume Nault
56f200c78c netns: Constify exported functions
Mark function parameters as 'const' where possible.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-17 13:25:24 +01:00
Pablo Neira Ayuso
f698fe4082 netfilter: flowtable: refresh flow if hardware offload fails
If nf_flow_offload_add() fails to add the flow to hardware, then the
NF_FLOW_HW_REFRESH flag bit is set and the flow remains in the flowtable
software path.

If flowtable hardware offload is enabled, this patch enqueues a new
request to offload this flow to hardware.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:52 +01:00
Pablo Neira Ayuso
a5449cdcaa netfilter: flowtable: add nf_flowtable_hw_offload() helper function
This function checks for the NF_FLOWTABLE_HW_OFFLOAD flag, meaning that
the flowtable hardware offload is enabled.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:51 +01:00
Pablo Neira Ayuso
355a8b13f8 netfilter: flowtable: use atomic bitwise operations for flow flags
Originally, all flow flag bits were set on only from the workqueue. With
the introduction of the flow teardown state and hardware offload this is
no longer true. Let's be safe and use atomic bitwise operation to
operation with flow flags.

Fixes: 59c466dd68 ("netfilter: nf_flow_table: add a new flow state for tearing down offloading")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:50 +01:00
Pablo Neira Ayuso
445db8d096 netfilter: flowtable: remove dying bit, use teardown bit instead
The dying bit removes the conntrack entry if the netdev that owns this
flow is going down. Instead, use the teardown mechanism to push back the
flow to conntrack to let the classic software path decide what to do
with it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2020-01-16 15:51:49 +01:00
David S. Miller
3981f955eb Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2020-01-15

The following pull-request contains BPF updates for your *net* tree.

We've added 12 non-merge commits during the last 9 day(s) which contain
a total of 13 files changed, 95 insertions(+), 43 deletions(-).

The main changes are:

1) Fix refcount leak for TCP time wait and request sockets for socket lookup
   related BPF helpers, from Lorenz Bauer.

2) Fix wrong verification of ARSH instruction under ALU32, from Daniel Borkmann.

3) Batch of several sockmap and related TLS fixes found while operating
   more complex BPF programs with Cilium and OpenSSL, from John Fastabend.

4) Fix sockmap to read psock's ingress_msg queue before regular sk_receive_queue()
   to avoid purging data upon teardown, from Lingpeng Chen.

5) Fix printing incorrect pointer in bpftool's btf_dump_ptr() in order to properly
   dump a BPF map's value with BTF, from Martin KaFai Lau.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-16 10:04:40 +01:00
John Fastabend
33bfe20dd7 bpf: Sockmap/tls, push write_space updates through ulp updates
When sockmap sock with TLS enabled is removed we cleanup bpf/psock state
and call tcp_update_ulp() to push updates to TLS ULP on top. However, we
don't push the write_space callback up and instead simply overwrite the
op with the psock stored previous op. This may or may not be correct so
to ensure we don't overwrite the TLS write space hook pass this field to
the ULP and have it fixup the ctx.

This completes a previous fix that pushed the ops through to the ULP
but at the time missed doing this for write_space, presumably because
write_space TLS hook was added around the same time.

Fixes: 95fa145479 ("bpf: sockmap/tls, close can race with map free")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Reviewed-by: Jakub Sitnicki <jakub@cloudflare.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/bpf/20200111061206.8028-4-john.fastabend@gmail.com
2020-01-15 23:26:13 +01:00
Luiz Augusto von Dentz
f9a619db7c Bluetooth: monitor: Add support for ISO packets
This enables passing ISO packets to the monitor socket.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-15 22:28:51 +01:00
Luiz Augusto von Dentz
4de0fc599e Bluetooth: Add definitions for CIS connections
These adds the HCI definitions for handling CIS connections along with
ISO data packets.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-15 22:28:51 +01:00
Alain Michaud
600a87490f Bluetooth: Implementation of MGMT_OP_SET_BLOCKED_KEYS.
MGMT command is added to receive the list of blocked keys from
user-space.

The list is used to:
1) Block keys from being distributed by the device during
   the ke distribution phase of SMP.
2) Filter out any keys that were previously saved so
   they are no longer used.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-15 22:28:41 +01:00
David S. Miller
eb507906fe A few fixes:
* -O3 enablement fallout, thanks to Arnd who ran this
  * fixes for a few leaks, thanks to Felix
  * channel 12 regulatory fix for custom regdomains
  * check for a crash reported by syzbot
    (NULL function is called on drivers that don't have it)
  * fix TKIP replay protection after setup with some APs
    (from Jouni)
  * restrict obtaining some mesh data to avoid WARN_ONs
  * fix deadlocks with auto-disconnect (socket owner)
  * fix radar detection events with multiple devices
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl4e4voACgkQB8qZga/f
 l8RVFg/+Ngjzje8xU6Ymh7VSwvBBz6CmSVwH/nRSsIAszDrQLoDCo7+eWeUHnAKG
 2ViFQ4Lf5CP2J9kSgi7pbD0yGQRDGxH1CVSxMB64uZZRErlOCZYLRTIeBn7HGb8w
 JhqAVXBjskzP9iB5jUZIkOgJ6L2ESM2KYKQVGtgn8p9fFdHKitec8UEKiPU3tcX1
 zx94EbKldynuvDNb7VkTm4MdoY8LK5KI8ovVdVKwJ05qCtLvc6xvEBYhm33Fl6Pe
 66cWklCrU093jQXUDTk0yBkwlcXGxJc143y1nXuurI8ZLH6q2wdDUAMHUEKdL6mz
 ktHxABK4JA4Z/ylphR2Tp1TNh8xnz/ZZ5IdAe4P9K0/PerrjH8Yk7zf/48lrHTSv
 LKb86Oaoq4MY4k5UNyjDUdgPB/q8CgxN2xDQaeQX69u740wrlxYtrwgHmqHrKI69
 ppCzL7a3wmJzAz/97TMFhSseCH1WEQfWY+tlzktrvRofSgOGsIUtcPP10NmvGZHj
 zdF79E9TiUPX4KVWssRb9A2MjBbvKA31hCQbH/OVKSWt/fItoBUbMQKtPxfnHcam
 /W/gCWrOpw5Qqhc/BdFbYYsgY9J06KzifsV5e2Z6he+6Ex+rtdZNoLXc6/4QH3+9
 HzsVYBP+ZUaqrMJi64KDN59C3ni4jDVs9zfYnVkC9LhzHASI5EE=
 =i20C
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-net-2020-01-15' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A few fixes:
 * -O3 enablement fallout, thanks to Arnd who ran this
 * fixes for a few leaks, thanks to Felix
 * channel 12 regulatory fix for custom regdomains
 * check for a crash reported by syzbot
   (NULL function is called on drivers that don't have it)
 * fix TKIP replay protection after setup with some APs
   (from Jouni)
 * restrict obtaining some mesh data to avoid WARN_ONs
 * fix deadlocks with auto-disconnect (socket owner)
 * fix radar detection events with multiple devices
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-15 04:12:00 -08:00
John Crispin
dd56e90230 mac80211: add handling for BSS color
It is now possible to propagate BSS color settings into the subsystem. Lets
make mac80211 also handle them so that we can send them further down the
stack into the drivers. We drop the old bss_color field and change iwlwifi
to use the new he_bss_color struct.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191217141921.8114-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 11:18:50 +01:00
John Crispin
5c5e52d1bb nl80211: add handling for BSS color
This patch adds the attributes, policy and parsing code to allow userland
to send the info about the BSS coloring settings to the kernel.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191217141921.8114-1-john@phrozen.org
[johannes: remove the strict policy parsing, that was a misunderstanding]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 11:14:24 +01:00
Orr Mazor
26ec17a1dc cfg80211: Fix radar event during another phy CAC
In case a radar event of CAC_FINISHED or RADAR_DETECTED
happens during another phy is during CAC we might need
to cancel that CAC.

If we got a radar in a channel that another phy is now
doing CAC on then the CAC should be canceled there.

If, for example, 2 phys doing CAC on the same channels,
or on comptable channels, once on of them will finish his
CAC the other might need to cancel his CAC, since it is no
longer relevant.

To fix that the commit adds an callback and implement it in
mac80211 to end CAC.
This commit also adds a call to said callback if after a radar
event we see the CAC is no longer relevant

Signed-off-by: Orr Mazor <Orr.Mazor@tandemg.com>
Reviewed-by: Sergey Matyukevich <sergey.matyukevich.os@quantenna.com>
Link: https://lore.kernel.org/r/20191222145449.15792-1-Orr.Mazor@tandemg.com
[slightly reformat/reword commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-15 09:50:48 +01:00
Ido Schimmel
bb3c4ab93e ipv6: Add "offload" and "trap" indications to routes
In a similar fashion to previous patch, add "offload" and "trap"
indication to IPv6 routes.

This is done by using two unused bits in 'struct fib6_info' to hold
these indications. Capable drivers are expected to set these when
processing the various in-kernel route notifications.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Acked-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Ido Schimmel
90b93f1b31 ipv4: Add "offload" and "trap" indications to routes
When performing L3 offload, routes and nexthops are usually programmed
into two different tables in the underlying device. Therefore, the fact
that a nexthop resides in hardware does not necessarily mean that all
the associated routes also reside in hardware and vice-versa.

While the kernel can signal to user space the presence of a nexthop in
hardware (via 'RTNH_F_OFFLOAD'), it does not have a corresponding flag
for routes. In addition, the fact that a route resides in hardware does
not necessarily mean that the traffic is offloaded. For example,
unreachable routes (i.e., 'RTN_UNREACHABLE') are programmed to trap
packets to the CPU so that the kernel will be able to generate the
appropriate ICMP error packet.

This patch adds an "offload" and "trap" indications to IPv4 routes, so
that users will have better visibility into the offload process.

'struct fib_alias' is extended with two new fields that indicate if the
route resides in hardware or not and if it is offloading traffic from
the kernel or trapping packets to it. Note that the new fields are added
in the 6 bytes hole and therefore the struct still fits in a single
cache line [1].

Capable drivers are expected to invoke fib_alias_hw_flags_set() with the
route's key in order to set the flags.

The indications are dumped to user space via a new flags (i.e.,
'RTM_F_OFFLOAD' and 'RTM_F_TRAP') in the 'rtm_flags' field in the
ancillary header.

v2:
* Make use of 'struct fib_rt_info' in fib_alias_hw_flags_set()

[1]
struct fib_alias {
        struct hlist_node  fa_list;                      /*     0    16 */
        struct fib_info *          fa_info;              /*    16     8 */
        u8                         fa_tos;               /*    24     1 */
        u8                         fa_type;              /*    25     1 */
        u8                         fa_state;             /*    26     1 */
        u8                         fa_slen;              /*    27     1 */
        u32                        tb_id;                /*    28     4 */
        s16                        fa_default;           /*    32     2 */
        u8                         offload:1;            /*    34: 0  1 */
        u8                         trap:1;               /*    34: 1  1 */
        u8                         unused:6;             /*    34: 2  1 */

        /* XXX 5 bytes hole, try to pack */

        struct callback_head rcu __attribute__((__aligned__(8))); /*    40    16 */

        /* size: 56, cachelines: 1, members: 12 */
        /* sum members: 50, holes: 1, sum holes: 5 */
        /* sum bitfield members: 8 bits (1 bytes) */
        /* forced alignments: 1, forced holes: 1, sum forced holes: 5 */
        /* last cacheline: 56 bytes */
} __attribute__((__aligned__(8)));

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Ido Schimmel
1e301fd04e ipv4: Encapsulate function arguments in a struct
fib_dump_info() is used to prepare RTM_{NEW,DEL}ROUTE netlink messages
using the passed arguments. Currently, the function takes 11 arguments,
6 of which are attributes of the route being dumped (e.g., prefix, TOS).

The next patch will need the function to also dump to user space an
indication if the route is present in hardware or not. Instead of
passing yet another argument, change the function to take a struct
containing the different route attributes.

v2:
* Name last argument of fib_dump_info()
* Move 'struct fib_rt_info' to include/net/ip_fib.h so that it could
  later be passed to fib_alias_hw_flags_set()

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 18:53:35 -08:00
Antoine Tenart
5c937de78b net: macsec: PN wrap callback
Allow to call macsec_pn_wrapped from hardware drivers to notify when a
PN rolls over. Some drivers might used an interrupt to implement this.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:31:41 -08:00
Antoine Tenart
0830e20b62 net: macsec: introduce MACsec ops
This patch introduces MACsec ops for drivers to support offloading
MACsec operations.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:31:41 -08:00
Antoine Tenart
76564261a7 net: macsec: introduce the macsec_context structure
This patch introduces the macsec_context structure. It will be used
in the kernel to exchange information between the common MACsec
implementation (macsec.c) and the MACsec hardware offloading
implementations. This structure contains pointers to MACsec specific
structures which contain the actual MACsec configuration, and to the
underlying device (phydev for now).

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:31:41 -08:00
Antoine Tenart
c0e4eadfb8 net: macsec: move some definitions in a dedicated header
This patch moves some structure, type and identifier definitions into a
MACsec specific header. This patch does not modify how the MACsec code
is running and only move things around. This is a preparation for the
future MACsec hardware offloading support, which will re-use those
definitions outside macsec.c.

Signed-off-by: Antoine Tenart <antoine.tenart@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-14 11:31:41 -08:00
Jacob Keller
b0efcae5e1 devlink: correct misspelling of snapshot
The function to obtain a unique snapshot id was mistakenly typo'd as
devlink_region_shapshot_id_get. Fix this typo by renaming the function
and all of its users.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-11 14:30:24 -08:00
Jacob Keller
f4bdd71036 devlink: move devlink documentation to subfolder
Combine the documentation for devlink into a subfolder, and provide an
index.rst file that can be used to generally describe devlink.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 17:06:59 -08:00
Jacob Keller
468672b24f devlink: add macro for "fw.psid"
The "fw.psid" devlink info version is documented in devlink-info.rst,
and used by one driver. However, there is no associated macro for this
firmware version like there is for others. Add one now.

Signed-off-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-10 17:06:59 -08:00
Paolo Abeni
e66b2f31a0 tcp: clean ext on tx recycle
Otherwise we will find stray/unexpected/old extensions value on next
iteration.

On tcp_write_xmit() we can end-up splitting an already queued skb in two
parts, via tso_fragment(). The newly created skb can be allocated via
the tx cache and an upper layer will not be aware of it, so that upper
layer cannot set the ext properly.

Resetting the ext on recycle ensures that stale data is not propagated
in to packet headers or elsewhere.

An alternative would be add an additional hook in tso_fragment() or in
sk_stream_alloc_skb() to init the ext for upper layers that need it.

Co-developed-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
35b2c32116 tcp: Export TCP functions and ops struct
MPTCP will make use of tcp_send_mss() and tcp_push() when sending
data to specific TCP subflows.

tcp_request_sock_ipvX_ops and ipvX_specific will be referenced
during TCP subflow creation.

Co-developed-by: Peter Krystad <peter.krystad@linux.intel.com>
Signed-off-by: Peter Krystad <peter.krystad@linux.intel.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
8571248411 tcp: coalesce/collapse must respect MPTCP extensions
Coalesce and collapse of packets carrying MPTCP extensions is allowed
when the newer packet has no extension or the extensions carried by both
packets are equal.

This allows merging of TSO packet trains and even cross-TSO packets, and
does not require any additional action when moving data into existing
SKBs.

v3 -> v4:
 - allow collapsing, under mptcp_skb_can_collapse() constraint

v5 -> v6:
 - clarify MPTCP skb extensions must always be cleared at allocation
   time

Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
3ee17bc78e mptcp: Add MPTCP to skb extensions
Add enum value for MPTCP and update config dependencies

v5 -> v6:
 - fixed '__unused' field size

Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
1323059301 tcp, ulp: Add clone operation to tcp_ulp_ops
If ULP is used on a listening socket, icsk_ulp_ops and icsk_ulp_data are
copied when the listener is cloned. Sometimes the clone is immediately
deleted, which will invoke the release op on the clone and likely
corrupt the listening socket's icsk_ulp_data.

The clone operation is invoked immediately after the clone is copied and
gives the ULP type an opportunity to set up the clone socket and its
icsk_ulp_data.

The MPTCP ULP clone will silently fallback to plain TCP on allocation
failure, so 'clone()' does not need to return an error code.

v6 -> v7:
 - move and rename ulp clone helper to make it inline-friendly
v5 -> v6:
 - clarified MPTCP clone usage in commit message

Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
c74a39c861 tcp: Add MPTCP option number
TCP option 30 is allocated for MPTCP by the IANA.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
bf9765145b sock: Make sk_protocol a 16-bit value
Match the 16-bit width of skbuff->protocol. Fills an 8-bit hole so
sizeof(struct sock) does not change.

Also take care of BPF field access for sk_type/sk_protocol. Both of them
are now outside the bitfield, so we can use load instructions without
further shifting/masking.

v5 -> v6:
 - update eBPF accessors, too (Intel's kbuild test robot)
v2 -> v3:
 - keep 'sk_type' 2 bytes aligned (Eric)
v1 -> v2:
 - preserve sk_pacing_shift as bit field (Eric)

Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Cc: bpf@vger.kernel.org
Co-developed-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:41 -08:00
Mat Martineau
e9cdced78d net: Make sock protocol value checks more specific
SK_PROTOCOL_MAX is only used in two places, for DECNet and AX.25. The
limits have more to do with the those protocol definitions than they do
with the data type of sk_protocol, so remove SK_PROTOCOL_MAX and use
U8_MAX directly.

Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 18:41:40 -08:00
David S. Miller
a2d6d7ae59 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The ungrafting from PRIO bug fixes in net, when merged into net-next,
merge cleanly but create a build failure.  The resolution used here is
from Petr Machata.

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-09 12:13:43 -08:00
Martin KaFai Lau
0baf26b0fc bpf: tcp: Support tcp_congestion_ops in bpf
This patch makes "struct tcp_congestion_ops" to be the first user
of BPF STRUCT_OPS.  It allows implementing a tcp_congestion_ops
in bpf.

The BPF implemented tcp_congestion_ops can be used like
regular kernel tcp-cc through sysctl and setsockopt.  e.g.
[root@arch-fb-vm1 bpf]# sysctl -a | egrep congestion
net.ipv4.tcp_allowed_congestion_control = reno cubic bpf_cubic
net.ipv4.tcp_available_congestion_control = reno bic cubic bpf_cubic
net.ipv4.tcp_congestion_control = bpf_cubic

There has been attempt to move the TCP CC to the user space
(e.g. CCP in TCP).   The common arguments are faster turn around,
get away from long-tail kernel versions in production...etc,
which are legit points.

BPF has been the continuous effort to join both kernel and
userspace upsides together (e.g. XDP to gain the performance
advantage without bypassing the kernel).  The recent BPF
advancements (in particular BTF-aware verifier, BPF trampoline,
BPF CO-RE...) made implementing kernel struct ops (e.g. tcp cc)
possible in BPF.  It allows a faster turnaround for testing algorithm
in the production while leveraging the existing (and continue growing)
BPF feature/framework instead of building one specifically for
userspace TCP CC.

This patch allows write access to a few fields in tcp-sock
(in bpf_tcp_ca_btf_struct_access()).

The optional "get_info" is unsupported now.  It can be added
later.  One possible way is to output the info with a btf-id
to describe the content.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Andrii Nakryiko <andriin@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Link: https://lore.kernel.org/bpf/20200109003508.3856115-1-kafai@fb.com
2020-01-09 08:46:18 -08:00
Florian Fainelli
4d776482ec net: dsa: Get information about stacked DSA protocol
It is possible to stack multiple DSA switches in a way that they are not
part of the tree (disjoint) but the DSA master of a switch is a DSA
slave of another. When that happens switch drivers may have to know this
is the case so as to determine whether their tagging protocol has a
remove chance of working.

This is useful for specific switch drivers such as b53 where devices
have been known to be stacked in the wild without the Broadcom tag
protocol supporting that feature. This allows b53 to continue supporting
those devices by forcing the disabling of Broadcom tags on the outermost
switches if necessary.

The get_tag_protocol() function is therefore updated to gain an
additional enum dsa_tag_protocol argument which denotes the current
tagging protocol used by the DSA master we are attached to, else
DSA_TAG_PROTO_NONE for the top of the dsa_switch_tree.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 16:01:13 -08:00
Vikas Gupta
6181e5cb75 devlink: add support for reporter recovery completion
It is possible that a reporter recovery completion do not finish
successfully when recovery is triggered via
devlink_health_reporter_recover as recovery could be processed in
different context. In such scenario an error is returned by driver when
recover hook is invoked and successful recovery completion is
intimated later.
Expose devlink recover done API to update recovery stats.

Signed-off-by: Vikas Gupta <vikas.gupta@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:47:50 -08:00
David S. Miller
b73a65610b Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing netns context in arp_tables, from Florian Westphal.

2) Underflow in flowtable reference counter, from wenxu.

3) Fix incorrect ethernet destination address in flowtable offload,
   from wenxu.

4) Check for status of neighbour entry, from wenxu.

5) Fix NAT port mangling, from wenxu.

6) Unbind callbacks from destroy path to cleanup hardware properly
   on flowtable removal.

7) Fix missing casting statistics timestamp, add nf_flowtable_time_stamp
   and use it.

8) NULL pointer exception when timeout argument is null in conntrack
   dccp and sctp protocol helpers, from Florian Westphal.

9) Possible nul-dereference in ipset with IPSET_ATTR_LINENO, also from
   Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-08 15:22:41 -08:00
Pablo Neira Ayuso
fb46f1b780 netfilter: flowtable: add nf_flowtable_time_stamp
This patch adds nf_flowtable_time_stamp and updates the existing code to
use it.

This patch is also implicitly fixing up hardware statistic fetching via
nf_flow_offload_stats() where casting to u32 is missing. Use
nf_flow_timeout_delta() to fix this.

Fixes: c29f74e0df ("netfilter: nf_flow_table: hardware offload support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2020-01-06 10:30:46 +01:00
Vladimir Oltean
787cac3f5a net: dsa: Pass pcs_poll flag from driver to PHYLINK
The DSA drivers that implement .phylink_mac_link_state should normally
register an interrupt for the PCS, from which they should call
phylink_mac_change(). However not all switches implement this, and those
who don't should set this flag in dsa_switch in the .setup callback, so
that PHYLINK will poll for a few ms until the in-band AN link timer
expires and the PCS state settles.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 23:22:32 -08:00
Vladimir Oltean
a68578c20a net: dsa: Make deferred_xmit private to sja1105
There are 3 things that are wrong with the DSA deferred xmit mechanism:

1. Its introduction has made the DSA hotpath ever so slightly more
   inefficient for everybody, since DSA_SKB_CB(skb)->deferred_xmit needs
   to be initialized to false for every transmitted frame, in order to
   figure out whether the driver requested deferral or not (a very rare
   occasion, rare even for the only driver that does use this mechanism:
   sja1105). That was necessary to avoid kfree_skb from freeing the skb.

2. Because L2 PTP is a link-local protocol like STP, it requires
   management routes and deferred xmit with this switch. But as opposed
   to STP, the deferred work mechanism needs to schedule the packet
   rather quickly for the TX timstamp to be collected in time and sent
   to user space. But there is no provision for controlling the
   scheduling priority of this deferred xmit workqueue. Too bad this is
   a rather specific requirement for a feature that nobody else uses
   (more below).

3. Perhaps most importantly, it makes the DSA core adhere a bit too
   much to the NXP company-wide policy "Innovate Where It Doesn't
   Matter". The sja1105 is probably the only DSA switch that requires
   some frames sent from the CPU to be routed to the slave port via an
   out-of-band configuration (register write) rather than in-band (DSA
   tag). And there are indeed very good reasons to not want to do that:
   if that out-of-band register is at the other end of a slow bus such
   as SPI, then you limit that Ethernet flow's throughput to effectively
   the throughput of the SPI bus. So hardware vendors should definitely
   not be encouraged to design this way. We do _not_ want more
   widespread use of this mechanism.

Luckily we have a solution for each of the 3 issues:

For 1, we can just remove that variable in the skb->cb and counteract
the effect of kfree_skb with skb_get, much to the same effect. The
advantage, of course, being that anybody who doesn't use deferred xmit
doesn't need to do any extra operation in the hotpath.

For 2, we can create a kernel thread for each port's deferred xmit work.
If the user switch ports are named swp0, swp1, swp2, the kernel threads
will be named swp0_xmit, swp1_xmit, swp2_xmit (there appears to be a 15
character length limit on kernel thread names). With this, the user can
change the scheduling priority with chrt $(pidof swp2_xmit).

For 3, we can actually move the entire implementation to the sja1105
driver.

So this patch deletes the generic implementation from the DSA core and
adds a new one, more adequate to the requirements of PTP TX
timestamping, in sja1105_main.c.

Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-05 15:13:13 -08:00
Luiz Augusto von Dentz
1efd927d66 Bluetooth: Add support for LE PHY Update Complete event
This handles LE PHY Update Complete event and store both tx_phy and
rx_phy into hci_conn.

Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-04 10:49:23 +01:00
Marcel Holtmann
657cc64647 Bluetooth: Remove usage of BT_ERR_RATELIMITED macro
The macro is really not needed and can be replaced with either usage of
bt_err_ratelimited or bt_dev_err_ratelimited.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2020-01-04 10:41:03 +01:00
Alain Michaud
36278a5d4d Bluetooth: Adding a bt_dev_warn_ratelimited macro.
The macro will be used to display rate limited warning messages in the
log.

Signed-off-by: Alain Michaud <alainm@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2020-01-04 10:41:03 +01:00
David Ahern
dea53bb80e tcp: Add l3index to tcp_md5sig_key and md5 functions
Add l3index to tcp_md5sig_key to represent the L3 domain of a key, and
add l3index to tcp_md5_do_add and tcp_md5_do_del to fill in the key.

With the key now based on an l3index, add the new parameter to the
lookup functions and consider the l3index when looking for a match.

The l3index comes from the skb when processing ingress packets leveraging
the helpers created for socket lookups, tcp_v4_sdif and inet_iif (and the
v6 variants). When the sdif index is set it means the packet ingressed a
device that is part of an L3 domain and inet_iif points to the VRF device.
For egress, the L3 domain is determined from the socket binding and
sk_bound_dev_if.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2020-01-02 15:51:22 -08:00
Johannes Berg
7d6aa9ba4f Merge remote-tracking branch 'net-next/master' into mac80211-next
Merging to get the mac80211 updates that have since propagated
into net-next.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2020-01-02 14:50:58 +01:00
David S. Miller
31d518f35e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Simple overlapping changes in bpf land wrt. bpf_helper_defs.h
handling.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-31 13:37:13 -08:00
Davide Caratti
a5b72a083d net/sched: add delete_empty() to filters and use it in cls_flower
Revert "net/sched: cls_u32: fix refcount leak in the error path of
u32_change()", and fix the u32 refcount leak in a more generic way that
preserves the semantic of rule dumping.
On tc filters that don't support lockless insertion/removal, there is no
need to guard against concurrent insertion when a removal is in progress.
Therefore, for most of them we can avoid a full walk() when deleting, and
just decrease the refcount, like it was done on older Linux kernels.
This fixes situations where walk() was wrongly detecting a non-empty
filter, like it happened with cls_u32 in the error path of change(), thus
leading to failures in the following tdc selftests:

 6aa7: (filter, u32) Add/Replace u32 with source match and invalid indev
 6658: (filter, u32) Add/Replace u32 with custom hash table and invalid handle
 74c2: (filter, u32) Add/Replace u32 filter with invalid hash table id

On cls_flower, and on (future) lockless filters, this check is necessary:
move all the check_empty() logic in a callback so that each filter
can have its own implementation. For cls_flower, it's sufficient to check
if no IDRs have been allocated.

This reverts commit 275c44aa19.

Changes since v1:
 - document the need for delete_empty() when TCF_PROTO_OPS_DOIT_UNLOCKED
   is used, thanks to Vlad Buslov
 - implement delete_empty() without doing fl_walk(), thanks to Vlad Buslov
 - squash revert and new fix in a single patch, to be nice with bisect
   tests that run tdc on u32 filter, thanks to Dave Miller

Fixes: 275c44aa19 ("net/sched: cls_u32: fix refcount leak in the error path of u32_change()")
Fixes: 6676d5e416 ("net: sched: set dedicated tcf_walker flag when tp is empty")
Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-30 20:35:19 -08:00
David S. Miller
2bbc078f81 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-12-27

The following pull-request contains BPF updates for your *net-next* tree.

We've added 127 non-merge commits during the last 17 day(s) which contain
a total of 110 files changed, 6901 insertions(+), 2721 deletions(-).

There are three merge conflicts. Conflicts and resolution looks as follows:

1) Merge conflict in net/bpf/test_run.c:

There was a tree-wide cleanup c593642c8b ("treewide: Use sizeof_field() macro")
which gets in the way with b590cb5f80 ("bpf: Switch to offsetofend in
BPF_PROG_TEST_RUN"):

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, priority) +
                             sizeof_field(struct __sk_buff, priority),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, priority),
  >>>>>>> 7c8dce4b16

There are a few occasions that look similar to this. Always take the chunk with
offsetofend(). Note that there is one where the fields differ in here:

  <<<<<<< HEAD
          if (!range_is_zero(__skb, offsetof(struct __sk_buff, tstamp) +
                             sizeof_field(struct __sk_buff, tstamp),
  =======
          if (!range_is_zero(__skb, offsetofend(struct __sk_buff, gso_segs),
  >>>>>>> 7c8dce4b16

Just take the one with offsetofend() /and/ gso_segs. Latter is correct due to
850a88cc40 ("bpf: Expose __sk_buff wire_len/gso_segs to BPF_PROG_TEST_RUN").

2) Merge conflict in arch/riscv/net/bpf_jit_comp.c:

(I'm keeping Bjorn in Cc here for a double-check in case I got it wrong.)

  <<<<<<< HEAD
          if (is_13b_check(off, insn))
                  return -1;
          emit(rv_blt(tcc, RV_REG_ZERO, off >> 1), ctx);
  =======
          emit_branch(BPF_JSLT, RV_REG_T1, RV_REG_ZERO, off, ctx);
  >>>>>>> 7c8dce4b16

Result should look like:

          emit_branch(BPF_JSLT, tcc, RV_REG_ZERO, off, ctx);

3) Merge conflict in arch/riscv/include/asm/pgtable.h:

  <<<<<<< HEAD
  =======
  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  #define vmemmap         ((struct page *)VMEMMAP_START)

  >>>>>>> 7c8dce4b16

Only take the BPF_* defines from there and move them higher up in the
same file. Remove the rest from the chunk. The VMALLOC_* etc defines
got moved via 01f52e16b8 ("riscv: define vmemmap before pfn_to_page
calls"). Result:

  [...]
  #define __S101  PAGE_READ_EXEC
  #define __S110  PAGE_SHARED_EXEC
  #define __S111  PAGE_SHARED_EXEC

  #define VMALLOC_SIZE     (KERN_VIRT_SIZE >> 1)
  #define VMALLOC_END      (PAGE_OFFSET - 1)
  #define VMALLOC_START    (PAGE_OFFSET - VMALLOC_SIZE)

  #define BPF_JIT_REGION_SIZE     (SZ_128M)
  #define BPF_JIT_REGION_START    (PAGE_OFFSET - BPF_JIT_REGION_SIZE)
  #define BPF_JIT_REGION_END      (VMALLOC_END)

  /*
   * Roughly size the vmemmap space to be large enough to fit enough
   * struct pages to map half the virtual address space. Then
   * position vmemmap directly below the VMALLOC region.
   */
  #define VMEMMAP_SHIFT \
          (CONFIG_VA_BITS - PAGE_SHIFT - 1 + STRUCT_PAGE_MAX_SHIFT)
  #define VMEMMAP_SIZE    BIT(VMEMMAP_SHIFT)
  #define VMEMMAP_END     (VMALLOC_START - 1)
  #define VMEMMAP_START   (VMALLOC_START - VMEMMAP_SIZE)

  [...]

Let me know if there are any other issues.

Anyway, the main changes are:

1) Extend bpftool to produce a struct (aka "skeleton") tailored and specific
   to a provided BPF object file. This provides an alternative, simplified API
   compared to standard libbpf interaction. Also, add libbpf extern variable
   resolution for .kconfig section to import Kconfig data, from Andrii Nakryiko.

2) Add BPF dispatcher for XDP which is a mechanism to avoid indirect calls by
   generating a branch funnel as discussed back in bpfconf'19 at LSF/MM. Also,
   add various BPF riscv JIT improvements, from Björn Töpel.

3) Extend bpftool to allow matching BPF programs and maps by name,
   from Paul Chaignon.

4) Support for replacing cgroup BPF programs attached with BPF_F_ALLOW_MULTI
   flag for allowing updates without service interruption, from Andrey Ignatov.

5) Cleanup and simplification of ring access functions for AF_XDP with a
   bonus of 0-5% performance improvement, from Magnus Karlsson.

6) Enable BPF JITs for x86-64 and arm64 by default. Also, final version of
   audit support for BPF, from Daniel Borkmann and latter with Jiri Olsa.

7) Move and extend test_select_reuseport into BPF program tests under
   BPF selftests, from Jakub Sitnicki.

8) Various BPF sample improvements for xdpsock for customizing parameters
   to set up and benchmark AF_XDP, from Jay Jayatheerthan.

9) Improve libbpf to provide a ulimit hint on permission denied errors.
   Also change XDP sample programs to attach in driver mode by default,
   from Toke Høiland-Jørgensen.

10) Extend BPF test infrastructure to allow changing skb mark from tc BPF
    programs, from Nikita V. Shirokov.

11) Optimize prologue code sequence in BPF arm32 JIT, from Russell King.

12) Fix xdp_redirect_cpu BPF sample to manually attach to tracepoints after
    libbpf conversion, from Jesper Dangaard Brouer.

13) Minor misc improvements from various others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-27 14:20:10 -08:00
Ido Schimmel
caafb2509f ipv6: Remove old route notifications and convert listeners
Now that mlxsw is converted to use the new FIB notifications it is
possible to delete the old ones and use the new replace / append /
delete notifications.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:37:30 -08:00
Ido Schimmel
d2f0c9b114 ipv6: Handle route deletion notification
For the purpose of route offload, when a single route is deleted, it is
only of interest if it is the first route in the node or if it is
sibling to such a route.

In the first case, distinguish between several possibilities:

1. Route is the last route in the node. Emit a delete notification

2. Route is followed by a non-multipath route. Emit a replace
notification for the non-multipath route.

3. Route is followed by a multipath route. Emit a replace notification
for the multipath route.

In the second case, only emit a delete notification to ensure the route
is no longer used as a valid nexthop.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:37:30 -08:00
Ido Schimmel
6b722237b6 net: fib_notifier: Add temporary events to the FIB notification chain
Subsequent patches are going to simplify the IPv6 route offload API,
which will only use three events - replace, delete and append.

Introduce a temporary version of replace and delete in order to make the
conversion easier to review. Note that append does not need a temporary
version, as it is currently not used.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:37:29 -08:00
Hangbin Liu
f081042d12 net/dst: do not confirm neighbor for vxlan and geneve pmtu update
When do IPv6 tunnel PMTU update and calls __ip6_rt_update_pmtu() in the end,
we should not call dst_confirm_neigh() as there is no two-way communication.

So disable the neigh confirm for vxlan and geneve pmtu update.

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Fixes: a93bf0ff44 ("vxlan: update skb dst pmtu on tx path")
Fixes: 52a589d51f ("geneve: update skb dst pmtu on tx path")
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Tested-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:55 -08:00
Hangbin Liu
07dc35c6e3 net/dst: add new function skb_dst_update_pmtu_no_confirm
Add a new function skb_dst_update_pmtu_no_confirm() for callers who need
update pmtu but should not do neighbor confirm.

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:54 -08:00
Hangbin Liu
bd085ef678 net: add bool confirm_neigh parameter for dst_ops.update_pmtu
The MTU update code is supposed to be invoked in response to real
networking events that update the PMTU. In IPv6 PMTU update function
__ip6_rt_update_pmtu() we called dst_confirm_neigh() to update neighbor
confirmed time.

But for tunnel code, it will call pmtu before xmit, like:
  - tnl_update_pmtu()
    - skb_dst_update_pmtu()
      - ip6_rt_update_pmtu()
        - __ip6_rt_update_pmtu()
          - dst_confirm_neigh()

If the tunnel remote dst mac address changed and we still do the neigh
confirm, we will not be able to update neigh cache and ping6 remote
will failed.

So for this ip_tunnel_xmit() case, _EVEN_ if the MTU is changed, we
should not be invoking dst_confirm_neigh() as we have no evidence
of successful two-way communication at this point.

On the other hand it is also important to keep the neigh reachability fresh
for TCP flows, so we cannot remove this dst_confirm_neigh() call.

To fix the issue, we have to add a new bool parameter for dst_ops.update_pmtu
to choose whether we should do neigh update or not. I will add the parameter
in this patch and set all the callers to true to comply with the previous
way, and fix the tunnel code one by one on later patches.

v5: No change.
v4: No change.
v3: Do not remove dst_confirm_neigh, but add a new bool parameter in
    dst_ops.update_pmtu to control whether we should do neighbor confirm.
    Also split the big patch to small ones for each area.
v2: Remove dst_confirm_neigh in __ip6_rt_update_pmtu.

Suggested-by: David Miller <davem@davemloft.net>
Reviewed-by: Guillaume Nault <gnault@redhat.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-24 22:28:54 -08:00
David S. Miller
ac80010fc9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Mere overlapping changes in the conflicts here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-22 15:15:05 -08:00
Linus Torvalds
78bac77b52 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:

 1) Several nf_flow_table_offload fixes from Pablo Neira Ayuso,
    including adding a missing ipv6 match description.

 2) Several heap overflow fixes in mwifiex from qize wang and Ganapathi
    Bhat.

 3) Fix uninit value in bond_neigh_init(), from Eric Dumazet.

 4) Fix non-ACPI probing of nxp-nci, from Stephan Gerhold.

 5) Fix use after free in tipc_disc_rcv(), from Tuong Lien.

 6) Enforce limit of 33 tail calls in mips and riscv JIT, from Paul
    Chaignon.

 7) Multicast MAC limit test is off by one in qede, from Manish Chopra.

 8) Fix established socket lookup race when socket goes from
    TCP_ESTABLISHED to TCP_LISTEN, because there lacks an intervening
    RCU grace period. From Eric Dumazet.

 9) Don't send empty SKBs from tcp_write_xmit(), also from Eric Dumazet.

10) Fix active backup transition after link failure in bonding, from
    Mahesh Bandewar.

11) Avoid zero sized hash table in gtp driver, from Taehee Yoo.

12) Fix wrong interface passed to ->mac_link_up(), from Russell King.

13) Fix DSA egress flooding settings in b53, from Florian Fainelli.

14) Memory leak in gmac_setup_txqs(), from Navid Emamdoost.

15) Fix double free in dpaa2-ptp code, from Ioana Ciornei.

16) Reject invalid MTU values in stmmac, from Jose Abreu.

17) Fix refcount leak in error path of u32 classifier, from Davide
    Caratti.

18) Fix regression causing iwlwifi firmware crashes on boot, from Anders
    Kaseorg.

19) Fix inverted return value logic in llc2 code, from Chan Shu Tak.

20) Disable hardware GRO when XDP is attached to qede, frm Manish
    Chopra.

21) Since we encode state in the low pointer bits, dst metrics must be
    at least 4 byte aligned, which is not necessarily true on m68k. Add
    annotations to fix this, from Geert Uytterhoeven.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (160 commits)
  sfc: Include XDP packet headroom in buffer step size.
  sfc: fix channel allocation with brute force
  net: dst: Force 4-byte alignment of dst_metrics
  selftests: pmtu: fix init mtu value in description
  hv_netvsc: Fix unwanted rx_table reset
  net: phy: ensure that phy IDs are correctly typed
  mod_devicetable: fix PHY module format
  qede: Disable hardware gro when xdp prog is installed
  net: ena: fix issues in setting interrupt moderation params in ethtool
  net: ena: fix default tx interrupt moderation interval
  net/smc: unregister ib devices in reboot_event
  net: stmmac: platform: Fix MDIO init for platforms without PHY
  llc2: Fix return statement of llc_stat_ev_rx_null_dsap_xid_c (and _test_c)
  net: hisilicon: Fix a BUG trigered by wrong bytes_compl
  net: dsa: ksz: use common define for tag len
  s390/qeth: don't return -ENOTSUPP to userspace
  s390/qeth: fix promiscuous mode after reset
  s390/qeth: handle error due to unsupported transport mode
  cxgb4: fix refcount init for TC-MQPRIO offload
  tc-testing: initial tdc selftests for cls_u32
  ...
2019-12-22 09:54:33 -08:00
Geert Uytterhoeven
258a980d1e net: dst: Force 4-byte alignment of dst_metrics
When storing a pointer to a dst_metrics structure in dst_entry._metrics,
two flags are added in the least significant bits of the pointer value.
Hence this assumes all pointers to dst_metrics structures have at least
4-byte alignment.

However, on m68k, the minimum alignment of 32-bit values is 2 bytes, not
4 bytes.  Hence in some kernel builds, dst_default_metrics may be only
2-byte aligned, leading to obscure boot warnings like:

    WARNING: CPU: 0 PID: 7 at lib/refcount.c:28 refcount_warn_saturate+0x44/0x9a
    refcount_t: underflow; use-after-free.
    Modules linked in:
    CPU: 0 PID: 7 Comm: ksoftirqd/0 Tainted: G        W         5.5.0-rc2-atari-01448-g114a1a1038af891d-dirty #261
    Stack from 10835e6c:
	    10835e6c 0038134f 00023fa6 00394b0f 0000001c 00000009 00321560 00023fea
	    00394b0f 0000001c 001a70f8 00000009 00000000 10835eb4 00000001 00000000
	    04208040 0000000a 00394b4a 10835ed4 00043aa8 001a70f8 00394b0f 0000001c
	    00000009 00394b4a 0026aba8 003215a4 00000003 00000000 0026d5a8 00000001
	    003215a4 003a4361 003238d6 000001f0 00000000 003215a4 10aa3b00 00025e84
	    003ddb00 10834000 002416a8 10aa3b00 00000000 00000080 000aa038 0004854a
    Call Trace: [<00023fa6>] __warn+0xb2/0xb4
     [<00023fea>] warn_slowpath_fmt+0x42/0x64
     [<001a70f8>] refcount_warn_saturate+0x44/0x9a
     [<00043aa8>] printk+0x0/0x18
     [<001a70f8>] refcount_warn_saturate+0x44/0x9a
     [<0026aba8>] refcount_sub_and_test.constprop.73+0x38/0x3e
     [<0026d5a8>] ipv4_dst_destroy+0x5e/0x7e
     [<00025e84>] __local_bh_enable_ip+0x0/0x8e
     [<002416a8>] dst_destroy+0x40/0xae

Fix this by forcing 4-byte alignment of all dst_metrics structures.

Fixes: e5fd387ad5 ("ipv6: do not overwrite inetpeer metrics prematurely")
Signed-off-by: Geert Uytterhoeven <geert@linux-m68k.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20 21:54:00 -08:00
Oleksij Rempel
48fda74f0a net: dsa: add support for Atheros AR9331 TAG format
Add support for tag format used in Atheros AR9331 built-in switch.

Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-20 17:05:47 -08:00
Magnus Karlsson
f8509aa078 xsk: ixgbe: i40e: ice: mlx5: Xsk_umem_discard_addr to xsk_umem_release_addr
Change the name of xsk_umem_discard_addr to xsk_umem_release_addr to
better reflect the new naming of the AF_XDP queue manipulation
functions. As this functions is used by drivers implementing support
for AF_XDP zero-copy, it requires a name change to these drivers. The
function xsk_umem_release_addr_rq has also changed name in the same
fashion.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1576759171-28550-10-git-send-email-magnus.karlsson@intel.com
2019-12-20 16:00:09 -08:00
Magnus Karlsson
03896ef1f0 xsk: Change names of validation functions
Change the names of the validation functions to better reflect what
they are doing. The uppermost ones are reading entries from the rings
and only the bottom ones validate entries. So xskq_cons_read_ is a
better prefix name.

Also change the xskq_cons_read_ functions to return a bool
as the the descriptor or address is already returned by reference
in the parameters. Everyone is using the return value as a bool
anyway.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Link: https://lore.kernel.org/bpf/1576759171-28550-9-git-send-email-magnus.karlsson@intel.com
2019-12-20 16:00:09 -08:00
Björn Töpel
e312b9e706 xsk: Make xskmap flush_list common for all map instances
The xskmap flush list is used to track entries that need to flushed
from via the xdp_do_flush_map() function. This list used to be
per-map, but there is really no reason for that. Instead make the
flush list global for all xskmaps, which simplifies __xsk_map_flush()
and xsk_map_alloc().

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/bpf/20191219061006.21980-5-bjorn.topel@gmail.com
2019-12-19 21:09:43 -08:00
Jakub Kicinski
8d5a49e9e3 net/tls: add helper for testing if socket is RX offloaded
There is currently no way for driver to reliably check that
the socket it has looked up is in fact RX offloaded. Add
a helper. This allows drivers to catch misbehaving firmware.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-19 17:46:51 -08:00
Petr Machata
d35eb52bd2 net: sch_ets: Make the ETS qdisc offloadable
Add hooks at appropriate points to make it possible to offload the ETS
Qdisc.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18 13:32:29 -08:00
Petr Machata
9586a992fb net: pkt_cls: Clarify a comment
The bit about negating HW backlog left me scratching my head. Clarify the
comment.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-18 13:32:29 -08:00
Eric Dumazet
7c68fa2bdd net: annotate lockless accesses to sk->sk_pacing_shift
sk->sk_pacing_shift can be read and written without lock
synchronization. This patch adds annotations to
document this fact and avoid future syzbot complains.

This might also avoid unexpected false sharing
in sk_pacing_shift_update(), as the compiler
could remove the conditional check and always
write over sk->sk_pacing_shift :

if (sk->sk_pacing_shift != val)
	sk->sk_pacing_shift = val;

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-17 22:09:52 -08:00
Sven Eckelmann
54e1f08bdd ipv6: Annotate ipv6_addr_is_* bitwise pointer casts
The sparse commit 6002ded74587 ("add a flag to warn on casts to/from
bitwise pointers") introduced a check for non-direct casts from/to
restricted datatypes (when -Wbitwise-pointer is enabled).

This triggered a warning in the 64 bit optimized ipv6_addr_is_*() functions
because sparse doesn't know that the buffer already points to some data in
the correct bitwise integer format. But these were correct and can
therefore be marked with __force to signalize sparse an intended cast to a
specific bitwise type.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16 16:09:44 -08:00
Sven Eckelmann
1f1c1d7c89 ipv6: Annotate bitwise IPv6 dsfield pointer cast
The sparse commit 6002ded74587 ("add a flag to warn on casts to/from
bitwise pointers") introduced a check for non-direct casts from/to
restricted datatypes (when -Wbitwise-pointer is enabled).

This triggered a warning in ipv6_get_dsfield() because sparse doesn't know
that the buffer already points to some data in the correct bitwise integer
format. This was already fixed in ipv6_change_dsfield() by the __force
attribute and can be fixed here the same way.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-16 16:09:44 -08:00
Eric Dumazet
ee2aabd3fc tcp: refine tcp_write_queue_empty() implementation
Due to how tcp_sendmsg() is implemented, we can have an empty
skb at the tail of the write queue.

Most [1] tcp_write_queue_empty() callers want to know if there is
anything to send (payload and/or FIN)

Instead of checking if the sk_write_queue is empty, we need
to test if tp->write_seq == tp->snd_nxt

[1] tcp_send_fin() was the only caller that expected to
 see if an skb was in the write queue, I have changed the code
 to reuse the tcp_write_queue_tail() result.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-12-13 21:58:40 -08:00
Eric Dumazet
8dbd76e79a tcp/dccp: fix possible race __inet_lookup_established()
Michal Kubecek and Firo Yang did a very nice analysis of crashes
happening in __inet_lookup_established().

Since a TCP socket can go from TCP_ESTABLISH to TCP_LISTEN
(via a close()/socket()/listen() cycle) without a RCU grace period,
I should not have changed listeners linkage in their hash table.

They must use the nulls protocol (Documentation/RCU/rculist_nulls.txt),
so that a lookup can detect a socket in a hash list was moved in
another one.

Since we added code in commit d296ba60d8 ("soreuseport: Resolve
merge conflict for v4/v6 ordering fix"), we have to add
hlist_nulls_add_tail_rcu() helper.

Fixes: 3b24d854cb ("tcp/dccp: do not touch listener sk_refcnt under synflood")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Reported-by: Firo Yang <firo.yang@suse.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Link: https://lore.kernel.org/netdev/20191120083919.GH27852@unicorn.suse.cz/
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-12-13 21:40:49 -08:00
John Crispin
50ff477a86 mac80211: add 802.11 encapsulation offloading support
This patch adds a new transmit path for hardware that supports 802.11
encapsulation offloading. In those cases 802.3 frames get passed
directly to the driver allowing the hardware to handle the encapsulation.
Some features such as monitor mode and TKIP would break when encapsulation
offloading is enabled. If any of these get enabled, the code will alwyas
fallback to the normal sw encapsulation data path.

The patch defines a secondary netdev_ops struct that the device gets
assigned if 802.11 encap support is available and enabled. The driver
needs to enable the support on a per vif basis if it finds that all
pre-reqs are meet.

Signed-off-by: Vasanthakumar Thiagarajan <vthiagar@qti.qualcomm.com>
Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20191125100438.16539-1-john@phrozen.org
[reword comments, remove SUPPORTS_80211_ENCAP HW flag, minor cleanups]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-12-13 20:02:36 +01:00
Michal Kubecek
32d5109a9d netlink: rename nl80211_validate_nested() to nla_validate_nested()
Function nl80211_validate_nested() is not specific to nl80211, it's
a counterpart to nla_validate_nested_deprecated() with strict validation.
For consistency with other validation and parse functions, rename it to
nla_validate_nested().

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-12 17:07:05 -08:00
Kirill Tkhai
3c32da19a8 unix: Show number of pending scm files of receive queue in fdinfo
Unix sockets like a block box. You never know what is stored there:
there may be a file descriptor holding a mount or a block device,
or there may be whole universes with namespaces, sockets with receive
queues full of sockets etc.

The patch adds a little debug and accounts number of files (not recursive),
which is in receive queue of a unix socket. Sometimes this is useful
to determine, that socket should be investigated or which task should
be killed to put reference counter on a resourse.

v2: Pass correct argument to lockdep

Signed-off-by: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-12 17:04:55 -08:00
Stefano Garzarella
0e12190578 vsock: add local transport support in the vsock core
This patch allows to register a transport able to handle
local communication (loopback).

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-11 15:01:23 -08:00
Kevin(Yudong) Yang
65e6d90168 net-tcp: Disable TCP ssthresh metrics cache by default
This patch introduces a sysctl knob "net.ipv4.tcp_no_ssthresh_metrics_save"
that disables TCP ssthresh metrics cache by default. Other parts of TCP
metrics cache, e.g. rtt, cwnd, remain unchanged.

As modern networks becoming more and more dynamic, TCP metrics cache
today often causes more harm than benefits. For example, the same IP
address is often shared by different subscribers behind NAT in residential
networks. Even if the IP address is not shared by different users,
caching the slow-start threshold of a previous short flow using loss-based
congestion control (e.g. cubic) often causes the future longer flows of
the same network path to exit slow-start prematurely with abysmal
throughput.

Caching ssthresh is very risky and can lead to terrible performance.
Therefore it makes sense to make disabling ssthresh caching by
default and opt-in for specific networks by the administrators.
This practice also has worked well for several years of deployment with
CUBIC congestion control at Google.

Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Kevin(Yudong) Yang <yyd@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09 20:17:48 -08:00
Pankaj Bharadiya
c593642c8b treewide: Use sizeof_field() macro
Replace all the occurrences of FIELD_SIZEOF() with sizeof_field() except
at places where these are defined. Later patches will remove the unused
definition of FIELD_SIZEOF().

This patch is generated using following script:

EXCLUDE_FILES="include/linux/stddef.h|include/linux/kernel.h"

git grep -l -e "\bFIELD_SIZEOF\b" | while read file;
do

	if [[ "$file" =~ $EXCLUDE_FILES ]]; then
		continue
	fi
	sed -i  -e 's/\bFIELD_SIZEOF\b/sizeof_field/g' $file;
done

Signed-off-by: Pankaj Bharadiya <pankaj.laxminarayan.bharadiya@intel.com>
Link: https://lore.kernel.org/r/20190924105839.110713-3-pankaj.laxminarayan.bharadiya@intel.com
Co-developed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: David Miller <davem@davemloft.net> # for net
2019-12-09 10:36:44 -08:00
Martin Schiller
f8fc57e8d7 net/x25: add new state X25_STATE_5
This is needed, because if the flag X25_ACCPT_APPRV_FLAG is not set on a
socket (manual call confirmation) and the channel is cleared by remote
before the manual call confirmation was sent, this situation needs to
be handled.

Signed-off-by: Martin Schiller <ms@dev.tdt.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09 10:28:43 -08:00
Eric Dumazet
f394722fb0 neighbour: remove neigh_cleanup() method
neigh_cleanup() has not been used for seven years, and was a wrong design.

Messing with shared pointer in bond_neigh_init() without proper
memory barriers would at least trigger syzbot complains eventually.

It is time to remove this stuff.

Fixes: b63b70d877 ("IPoIB: Use a private hash table for path lookup in xmit path")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-09 09:48:47 -08:00
Sabrina Dubroca
e27cca96cd xfrm: add espintcp (RFC 8229)
TCP encapsulation of IKE and IPsec messages (RFC 8229) is implemented
as a TCP ULP, overriding in particular the sendmsg and recvmsg
operations. A Stream Parser is used to extract messages out of the TCP
stream using the first 2 bytes as length marker. Received IKE messages
are put on "ike_queue", waiting to be dequeued by the custom recvmsg
implementation. Received ESP messages are sent to XFRM, like with UDP
encapsulation.

Some of this code is taken from the original submission by Herbert
Xu. Currently, only IPv4 is supported, like for UDP encapsulation.

Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-12-09 09:59:07 +01:00
Sabrina Dubroca
7b3801927e xfrm: introduce xfrm_trans_queue_net
This will be used by TCP encapsulation to write packets to the encap
socket without holding the user socket's lock. Without this reinjection,
we're already holding the lock of the user socket, and then try to lock
the encap socket as well when we enqueue the encrypted packet.

While at it, add a BUILD_BUG_ON like we usually do for skb->cb, since
it's missing for struct xfrm_trans_cb.

Co-developed-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-12-09 09:59:07 +01:00
Eric Dumazet
501a90c945 inet: protect against too small mtu values.
syzbot was once again able to crash a host by setting a very small mtu
on loopback device.

Let's make inetdev_valid_mtu() available in include/net/ip.h,
and use it in ip_setup_cork(), so that we protect both ip_append_page()
and __ip_append_data()

Also add a READ_ONCE() when the device mtu is read.

Pairs this lockless read with one WRITE_ONCE() in __dev_set_mtu(),
even if other code paths might write over this field.

Add a big comment in include/linux/netdevice.h about dev->mtu
needing READ_ONCE()/WRITE_ONCE() annotations.

Hopefully we will add the missing ones in followup patches.

[1]

refcount_t: saturated; leaking memory.
WARNING: CPU: 0 PID: 9464 at lib/refcount.c:22 refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
Kernel panic - not syncing: panic_on_warn set ...
CPU: 0 PID: 9464 Comm: syz-executor850 Not tainted 5.4.0-syzkaller #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x197/0x210 lib/dump_stack.c:118
 panic+0x2e3/0x75c kernel/panic.c:221
 __warn.cold+0x2f/0x3e kernel/panic.c:582
 report_bug+0x289/0x300 lib/bug.c:195
 fixup_bug arch/x86/kernel/traps.c:174 [inline]
 fixup_bug arch/x86/kernel/traps.c:169 [inline]
 do_error_trap+0x11b/0x200 arch/x86/kernel/traps.c:267
 do_invalid_op+0x37/0x50 arch/x86/kernel/traps.c:286
 invalid_op+0x23/0x30 arch/x86/entry/entry_64.S:1027
RIP: 0010:refcount_warn_saturate+0x138/0x1f0 lib/refcount.c:22
Code: 06 31 ff 89 de e8 c8 f5 e6 fd 84 db 0f 85 6f ff ff ff e8 7b f4 e6 fd 48 c7 c7 e0 71 4f 88 c6 05 56 a6 a4 06 01 e8 c7 a8 b7 fd <0f> 0b e9 50 ff ff ff e8 5c f4 e6 fd 0f b6 1d 3d a6 a4 06 31 ff 89
RSP: 0018:ffff88809689f550 EFLAGS: 00010286
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: ffffffff815e4336 RDI: ffffed1012d13e9c
RBP: ffff88809689f560 R08: ffff88809c50a3c0 R09: fffffbfff15d31b1
R10: fffffbfff15d31b0 R11: ffffffff8ae98d87 R12: 0000000000000001
R13: 0000000000040100 R14: ffff888099041104 R15: ffff888218d96e40
 refcount_add include/linux/refcount.h:193 [inline]
 skb_set_owner_w+0x2b6/0x410 net/core/sock.c:1999
 sock_wmalloc+0xf1/0x120 net/core/sock.c:2096
 ip_append_page+0x7ef/0x1190 net/ipv4/ip_output.c:1383
 udp_sendpage+0x1c7/0x480 net/ipv4/udp.c:1276
 inet_sendpage+0xdb/0x150 net/ipv4/af_inet.c:821
 kernel_sendpage+0x92/0xf0 net/socket.c:3794
 sock_sendpage+0x8b/0xc0 net/socket.c:936
 pipe_to_sendpage+0x2da/0x3c0 fs/splice.c:458
 splice_from_pipe_feed fs/splice.c:512 [inline]
 __splice_from_pipe+0x3ee/0x7c0 fs/splice.c:636
 splice_from_pipe+0x108/0x170 fs/splice.c:671
 generic_splice_sendpage+0x3c/0x50 fs/splice.c:842
 do_splice_from fs/splice.c:861 [inline]
 direct_splice_actor+0x123/0x190 fs/splice.c:1035
 splice_direct_to_actor+0x3b4/0xa30 fs/splice.c:990
 do_splice_direct+0x1da/0x2a0 fs/splice.c:1078
 do_sendfile+0x597/0xd00 fs/read_write.c:1464
 __do_sys_sendfile64 fs/read_write.c:1525 [inline]
 __se_sys_sendfile64 fs/read_write.c:1511 [inline]
 __x64_sys_sendfile64+0x1dd/0x220 fs/read_write.c:1511
 do_syscall_64+0xfa/0x790 arch/x86/entry/common.c:294
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x441409
Code: e8 ac e8 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fffb64c4f78 EFLAGS: 00000246 ORIG_RAX: 0000000000000028
RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 0000000000441409
RDX: 0000000000000000 RSI: 0000000000000006 RDI: 0000000000000005
RBP: 0000000000073b8a R08: 0000000000000010 R09: 0000000000000010
R10: 0000000000010001 R11: 0000000000000246 R12: 0000000000402180
R13: 0000000000402210 R14: 0000000000000000 R15: 0000000000000000
Kernel Offset: disabled
Rebooting in 86400 seconds..

Fixes: 1470ddf7f8 ("inet: Remove explicit write references to sk/inet in ip_append_data")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-07 11:55:11 -08:00
Guillaume Nault
721c8dafad tcp: Protect accesses to .ts_recent_stamp with {READ,WRITE}_ONCE()
Syncookies borrow the ->rx_opt.ts_recent_stamp field to store the
timestamp of the last synflood. Protect them with READ_ONCE() and
WRITE_ONCE() since reads and writes aren't serialised.

Use of .rx_opt.ts_recent_stamp for storing the synflood timestamp was
introduced by a0f82f64e2 ("syncookies: remove last_synq_overflow from
struct tcp_sock"). But unprotected accesses were already there when
timestamp was stored in .last_synq_overflow.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-06 21:05:14 -08:00
Guillaume Nault
cb44a08f86 tcp: tighten acceptance of ACKs not matching a child socket
When no synflood occurs, the synflood timestamp isn't updated.
Therefore it can be so old that time_after32() can consider it to be
in the future.

That's a problem for tcp_synq_no_recent_overflow() as it may report
that a recent overflow occurred while, in fact, it's just that jiffies
has grown past 'last_overflow' + TCP_SYNCOOKIE_VALID + 2^31.

Spurious detection of recent overflows lead to extra syncookie
verification in cookie_v[46]_check(). At that point, the verification
should fail and the packet dropped. But we should have dropped the
packet earlier as we didn't even send a syncookie.

Let's refine tcp_synq_no_recent_overflow() to report a recent overflow
only if jiffies is within the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval. This
way, no spurious recent overflow is reported when jiffies wraps and
'last_overflow' becomes in the future from the point of view of
time_after32().

However, if jiffies wraps and enters the
[last_overflow, last_overflow + TCP_SYNCOOKIE_VALID] interval (with
'last_overflow' being a stale synflood timestamp), then
tcp_synq_no_recent_overflow() still erroneously reports an
overflow. In such cases, we have to rely on syncookie verification
to drop the packet. We unfortunately have no way to differentiate
between a fresh and a stale syncookie timestamp.

In practice, using last_overflow as lower bound is problematic.
If the synflood timestamp is concurrently updated between the time
we read jiffies and the moment we store the timestamp in
'last_overflow', then 'now' becomes smaller than 'last_overflow' and
tcp_synq_no_recent_overflow() returns true, potentially dropping a
valid syncookie.

Reading jiffies after loading the timestamp could fix the problem,
but that'd require a memory barrier. Let's just accommodate for
potential timestamp growth instead and extend the interval using
'last_overflow - HZ' as lower bound.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-06 21:05:14 -08:00
Guillaume Nault
04d26e7b15 tcp: fix rejected syncookies due to stale timestamps
If no synflood happens for a long enough period of time, then the
synflood timestamp isn't refreshed and jiffies can advance so much
that time_after32() can't accurately compare them any more.

Therefore, we can end up in a situation where time_after32(now,
last_overflow + HZ) returns false, just because these two values are
too far apart. In that case, the synflood timestamp isn't updated as
it should be, which can trick tcp_synq_no_recent_overflow() into
rejecting valid syncookies.

For example, let's consider the following scenario on a system
with HZ=1000:

  * The synflood timestamp is 0, either because that's the timestamp
    of the last synflood or, more commonly, because we're working with
    a freshly created socket.

  * We receive a new SYN, which triggers synflood protection. Let's say
    that this happens when jiffies == 2147484649 (that is,
    'synflood timestamp' + HZ + 2^31 + 1).

  * Then tcp_synq_overflow() doesn't update the synflood timestamp,
    because time_after32(2147484649, 1000) returns false.
    With:
      - 2147484649: the value of jiffies, aka. 'now'.
      - 1000: the value of 'last_overflow' + HZ.

  * A bit later, we receive the ACK completing the 3WHS. But
    cookie_v[46]_check() rejects it because tcp_synq_no_recent_overflow()
    says that we're not under synflood. That's because
    time_after32(2147484649, 120000) returns false.
    With:
      - 2147484649: the value of jiffies, aka. 'now'.
      - 120000: the value of 'last_overflow' + TCP_SYNCOOKIE_VALID.

    Of course, in reality jiffies would have increased a bit, but this
    condition will last for the next 119 seconds, which is far enough
    to accommodate for jiffie's growth.

Fix this by updating the overflow timestamp whenever jiffies isn't
within the [last_overflow, last_overflow + HZ] range. That shouldn't
have any performance impact since the update still happens at most once
per second.

Now we're guaranteed to have fresh timestamps while under synflood, so
tcp_synq_no_recent_overflow() can safely use it with time_after32() in
such situations.

Stale timestamps can still make tcp_synq_no_recent_overflow() return
the wrong verdict when not under synflood. This will be handled in the
next patch.

For 64 bits architectures, the problem was introduced with the
conversion of ->tw_ts_recent_stamp to 32 bits integer by commit
cca9bab1b7 ("tcp: use monotonic timestamps for PAWS").
The problem has always been there on 32 bits architectures.

Fixes: cca9bab1b7 ("tcp: use monotonic timestamps for PAWS")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-06 21:05:14 -08:00
John Hurley
dbad340889 net: core: rename indirect block ingress cb function
With indirect blocks, a driver can register for callbacks from a device
that is does not 'own', for example, a tunnel device. When registering to
or unregistering from a new device, a callback is triggered to generate
a bind/unbind event. This, in turn, allows the driver to receive any
existing rules or to properly clean up installed rules.

When first added, it was assumed that all indirect block registrations
would be for ingress offloads. However, the NFP driver can, in some
instances, support clsact qdisc binds for egress offload.

Change the name of the indirect block callback command in flow_offload to
remove the 'ingress' identifier from it. While this does not change
functionality, a follow up patch will implement a more more generic
callback than just those currently just supporting ingress offload.

Fixes: 4d12ba4278 ("nfp: flower: allow offloading of matches on 'internal' ports")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-06 20:45:09 -08:00
Sabrina Dubroca
6c8991f415 net: ipv6_stub: use ip6_dst_lookup_flow instead of ip6_dst_lookup
ipv6_stub uses the ip6_dst_lookup function to allow other modules to
perform IPv6 lookups. However, this function skips the XFRM layer
entirely.

All users of ipv6_stub->ip6_dst_lookup use ip_route_output_flow (via the
ip_route_output_key and ip_route_output helpers) for their IPv4 lookups,
which calls xfrm_lookup_route(). This patch fixes this inconsistent
behavior by switching the stub to ip6_dst_lookup_flow, which also calls
xfrm_lookup_route().

This requires some changes in all the callers, as these two functions
take different arguments and have different return types.

Fixes: 5f81bd2e5d ("ipv6: export a stub for IPv6 symbols used by vxlan")
Reported-by: Xiumei Mu <xmu@redhat.com>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-04 12:27:13 -08:00
Sabrina Dubroca
c4e85f73af net: ipv6: add net argument to ip6_dst_lookup_flow
This will be used in the conversion of ipv6_stub to ip6_dst_lookup_flow,
as some modules currently pass a net argument without a socket to
ip6_dst_lookup. This is equivalent to commit 343d60aada ("ipv6: change
ipv6_stub_impl.ipv6_dst_lookup to take net argument").

Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-04 12:27:12 -08:00
Yoshiki Komachi
8ffb055bea cls_flower: Fix the behavior using port ranges with hw-offload
The recent commit 5c72299fba ("net: sched: cls_flower: Classify
packets using port ranges") had added filtering based on port ranges
to tc flower. However the commit missed necessary changes in hw-offload
code, so the feature gave rise to generating incorrect offloaded flow
keys in NIC.

One more detailed example is below:

$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
  dst_port 100-200 action drop

With the setup above, an exact match filter with dst_port == 0 will be
installed in NIC by hw-offload. IOW, the NIC will have a rule which is
equivalent to the following one.

$ tc qdisc add dev eth0 ingress
$ tc filter add dev eth0 ingress protocol ip flower ip_proto tcp \
  dst_port 0 action drop

The behavior was caused by the flow dissector which extracts packet
data into the flow key in the tc flower. More specifically, regardless
of exact match or specified port ranges, fl_init_dissector() set the
FLOW_DISSECTOR_KEY_PORTS flag in struct flow_dissector to extract port
numbers from skb in skb_flow_dissect() called by fl_classify(). Note
that device drivers received the same struct flow_dissector object as
used in skb_flow_dissect(). Thus, offloaded drivers could not identify
which of these is used because the FLOW_DISSECTOR_KEY_PORTS flag was
set to struct flow_dissector in either case.

This patch adds the new FLOW_DISSECTOR_KEY_PORTS_RANGE flag and the new
tp_range field in struct fl_flow_key to recognize which filters are applied
to offloaded drivers. At this point, when filters based on port ranges
passed to drivers, drivers return the EOPNOTSUPP error because they do
not support the feature (the newly created FLOW_DISSECTOR_KEY_PORTS_RANGE
flag).

Fixes: 5c72299fba ("net: sched: cls_flower: Classify packets using port ranges")
Signed-off-by: Yoshiki Komachi <komachi.yoshiki@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-12-03 11:55:46 -08:00
Jakub Kicinski
c5daa6cccd net/tls: use sg_next() to walk sg entries
Partially sent record cleanup path increments an SG entry
directly instead of using sg_next(). This should not be a
problem today, as encrypted messages should be always
allocated as arrays. But given this is a cleanup path it's
easy to miss was this ever to change. Use sg_next(), and
simplify the code.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-28 22:40:29 -08:00
Jakub Kicinski
9e5ffed37d net/tls: remove the dead inplace_crypto code
Looks like when BPF support was added by commit d3b18ad31f
("tls: add bpf support to sk_msg handling") and
commit d829e9c411 ("tls: convert to generic sk_msg interface")
it broke/removed the support for in-place crypto as added by
commit 4e6d47206c ("tls: Add support for inplace records
encryption").

The inplace_crypto member of struct tls_rec is dead, inited
to zero, and sometimes set to zero again. It used to be
set to 1 when record was allocated, but the skmsg code doesn't
seem to have been written with the idea of in-place crypto
in mind.

Since non trivial effort is required to bring the feature back
and we don't really have the HW to measure the benefit just
remove the left over support for now to avoid confusing readers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-28 22:40:29 -08:00
Linus Torvalds
8c39f71ee2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from David Miller:
 "This is mostly to fix the iwlwifi regression:

  1) Flush GRO state properly in iwlwifi driver, from Alexander Lobakin.

  2) Validate TIPC link name with properly length macro, from John
     Rutherford.

  3) Fix completion init and device query timeouts in ibmvnic, from
     Thomas Falcon.

  4) Fix SKB size calculation for netlink messages in psample, from
     Nikolay Aleksandrov.

  5) Similar kind of fix for OVS flow dumps, from Paolo Abeni.

  6) Handle queue allocation failure unwind properly in gve driver, we
     could try to release pages we didn't allocate. From Jeroen de
     Borst.

  7) Serialize TX queue SKB list accesses properly in mscc ocelot
     driver. From Yangbo Lu"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net:
  net: usb: aqc111: Use the correct style for SPDX License Identifier
  net: phy: Use the correct style for SPDX License Identifier
  net: wireless: intel: iwlwifi: fix GRO_NORMAL packet stalling
  net: mscc: ocelot: use skb queue instead of skbs list
  net: mscc: ocelot: avoid incorrect consuming in skbs list
  gve: Fix the queue page list allocated pages count
  net: inet_is_local_reserved_port() port arg should be unsigned short
  openvswitch: fix flow command message size
  net: phy: dp83869: Fix return paths to return proper values
  net: psample: fix skb_over_panic
  net: usbnet: Fix -Wcast-function-type
  net: hso: Fix -Wcast-function-type
  net: port < inet_prot_sock(net) --> inet_port_requires_bind_service(net, port)
  ibmvnic: Serialize device queries
  ibmvnic: Bound waits for device queries
  ibmvnic: Terminate waiting device threads after loss of service
  ibmvnic: Fix completion structure initialization
  net-sctp: replace some sock_net(sk) with just 'net'
  net: Fix a documentation bug wrt. ip_unprivileged_port_start
  tipc: fix link name length check
2019-11-27 17:17:40 -08:00
Linus Torvalds
168829ad09 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking updates from Ingo Molnar:
 "The main changes in this cycle were:

   - A comprehensive rewrite of the robust/PI futex code's exit handling
     to fix various exit races. (Thomas Gleixner et al)

   - Rework the generic REFCOUNT_FULL implementation using
     atomic_fetch_* operations so that the performance impact of the
     cmpxchg() loops is mitigated for common refcount operations.

     With these performance improvements the generic implementation of
     refcount_t should be good enough for everybody - and this got
     confirmed by performance testing, so remove ARCH_HAS_REFCOUNT and
     REFCOUNT_FULL entirely, leaving the generic implementation enabled
     unconditionally. (Will Deacon)

   - Other misc changes, fixes, cleanups"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (27 commits)
  lkdtm: Remove references to CONFIG_REFCOUNT_FULL
  locking/refcount: Remove unused 'refcount_error_report()' function
  locking/refcount: Consolidate implementations of refcount_t
  locking/refcount: Consolidate REFCOUNT_{MAX,SATURATED} definitions
  locking/refcount: Move saturation warnings out of line
  locking/refcount: Improve performance of generic REFCOUNT_FULL code
  locking/refcount: Move the bulk of the REFCOUNT_FULL implementation into the <linux/refcount.h> header
  locking/refcount: Remove unused refcount_*_checked() variants
  locking/refcount: Ensure integer operands are treated as signed
  locking/refcount: Define constants for saturation and max refcount values
  futex: Prevent exit livelock
  futex: Provide distinct return value when owner is exiting
  futex: Add mutex around futex exit
  futex: Provide state handling for exec() as well
  futex: Sanitize exit state handling
  futex: Mark the begin of futex exit explicitly
  futex: Set task::futex_state to DEAD right after handling futex exit
  futex: Split futex_mm_release() for exit/exec
  exit/exec: Seperate mm_release()
  futex: Replace PF_EXITPIDONE with a state
  ...
2019-11-26 16:02:40 -08:00
Maciej Żenczykowski
66e2f5f7c7 net: inet_is_local_reserved_port() port arg should be unsigned short
Any argument outside of that range would result in an out of bound
memory access, since the accessed array is 65536 bits long.

Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-26 15:15:07 -08:00
Maciej Żenczykowski
82f31ebf61 net: port < inet_prot_sock(net) --> inet_port_requires_bind_service(net, port)
Note that the sysctl write accessor functions guarantee that:
  net->ipv4.sysctl_ip_prot_sock <= net->ipv4.ip_local_ports.range[0]
invariant is maintained, and as such the max() in selinux hooks is actually spurious.

ie. even though
  if (snum < max(inet_prot_sock(sock_net(sk)), low) || snum > high) {
per logic is the same as
  if ((snum < inet_prot_sock(sock_net(sk)) && snum < low) || snum > high) {
it is actually functionally equivalent to:
  if (snum < low || snum > high) {
which is equivalent to:
  if (snum < inet_prot_sock(sock_net(sk)) || snum < low || snum > high) {
even though the first clause is spurious.

But we want to hold on to it in case we ever want to change what what
inet_port_requires_bind_service() means (for example by changing
it from a, by default, [0..1024) range to some sort of set).

Test: builds, git 'grep inet_prot_sock' finds no other references
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-26 13:20:46 -08:00
Linus Torvalds
386403a115 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:
 "Another merge window, another pull full of stuff:

   1) Support alternative names for network devices, from Jiri Pirko.

   2) Introduce per-netns netdev notifiers, also from Jiri Pirko.

   3) Support MSG_PEEK in vsock/virtio, from Matias Ezequiel Vara
      Larsen.

   4) Allow compiling out the TLS TOE code, from Jakub Kicinski.

   5) Add several new tracepoints to the kTLS code, also from Jakub.

   6) Support set channels ethtool callback in ena driver, from Sameeh
      Jubran.

   7) New SCTP events SCTP_ADDR_ADDED, SCTP_ADDR_REMOVED,
      SCTP_ADDR_MADE_PRIM, and SCTP_SEND_FAILED_EVENT. From Xin Long.

   8) Add XDP support to mvneta driver, from Lorenzo Bianconi.

   9) Lots of netfilter hw offload fixes, cleanups and enhancements,
      from Pablo Neira Ayuso.

  10) PTP support for aquantia chips, from Egor Pomozov.

  11) Add UDP segmentation offload support to igb, ixgbe, and i40e. From
      Josh Hunt.

  12) Add smart nagle to tipc, from Jon Maloy.

  13) Support L2 field rewrite by TC offloads in bnxt_en, from Venkat
      Duvvuru.

  14) Add a flow mask cache to OVS, from Tonghao Zhang.

  15) Add XDP support to ice driver, from Maciej Fijalkowski.

  16) Add AF_XDP support to ice driver, from Krzysztof Kazimierczak.

  17) Support UDP GSO offload in atlantic driver, from Igor Russkikh.

  18) Support it in stmmac driver too, from Jose Abreu.

  19) Support TIPC encryption and auth, from Tuong Lien.

  20) Introduce BPF trampolines, from Alexei Starovoitov.

  21) Make page_pool API more numa friendly, from Saeed Mahameed.

  22) Introduce route hints to ipv4 and ipv6, from Paolo Abeni.

  23) Add UDP segmentation offload to cxgb4, Rahul Lakkireddy"

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1857 commits)
  libbpf: Fix usage of u32 in userspace code
  mm: Implement no-MMU variant of vmalloc_user_node_flags
  slip: Fix use-after-free Read in slip_open
  net: dsa: sja1105: fix sja1105_parse_rgmii_delays()
  macvlan: schedule bc_work even if error
  enetc: add support Credit Based Shaper(CBS) for hardware offload
  net: phy: add helpers phy_(un)lock_mdio_bus
  mdio_bus: don't use managed reset-controller
  ax88179_178a: add ethtool_op_get_ts_info()
  mlxsw: spectrum_router: Fix use of uninitialized adjacency index
  mlxsw: spectrum_router: After underlay moves, demote conflicting tunnels
  bpf: Simplify __bpf_arch_text_poke poke type handling
  bpf: Introduce BPF_TRACE_x helper for the tracing tests
  bpf: Add bpf_jit_blinding_enabled for !CONFIG_BPF_JIT
  bpf, testing: Add various tail call test cases
  bpf, x86: Emit patchable direct jump as tail call
  bpf: Constant map key tracking for prog array pokes
  bpf: Add poke dependency tracking for prog array maps
  bpf: Add initial poke descriptor table for jit images
  bpf: Move owner type, jited info into array auxiliary data
  ...
2019-11-25 20:02:57 -08:00
Linus Torvalds
1b96a41b42 Merge branch 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup
Pull cgroup updates from Tejun Heo:
 "There are several notable changes here:

   - Single thread migrating itself has been optimized so that it
     doesn't need threadgroup rwsem anymore.

   - Freezer optimization to avoid unnecessary frozen state changes.

   - cgroup ID unification so that cgroup fs ino is the only unique ID
     used for the cgroup and can be used to directly look up live
     cgroups through filehandle interface on 64bit ino archs. On 32bit
     archs, cgroup fs ino is still the only ID in use but it is only
     unique when combined with gen.

   - selftest and other changes"

* 'for-5.5' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup: (24 commits)
  writeback: fix -Wformat compilation warnings
  docs: cgroup: mm: Fix spelling of "list"
  cgroup: fix incorrect WARN_ON_ONCE() in cgroup_setup_root()
  cgroup: use cgrp->kn->id as the cgroup ID
  kernfs: use 64bit inos if ino_t is 64bit
  kernfs: implement custom exportfs ops and fid type
  kernfs: combine ino/id lookup functions into kernfs_find_and_get_node_by_id()
  kernfs: convert kernfs_node->id from union kernfs_node_id to u64
  kernfs: kernfs_find_and_get_node_by_ino() should only look up activated nodes
  kernfs: use dumber locking for kernfs_find_and_get_node_by_ino()
  netprio: use css ID instead of cgroup ID
  writeback: use ino_t for inodes in tracepoints
  kernfs: fix ino wrap-around detection
  kselftests: cgroup: Avoid the reuse of fd after it is deallocated
  cgroup: freezer: don't change task and cgroups status unnecessarily
  cgroup: use cgroup->last_bstat instead of cgroup->bstat_pending for consistency
  cgroup: remove cgroup_enable_task_cg_lists() optimization
  cgroup: pids: use atomic64_t for pids->limit
  selftests: cgroup: Run test_core under interfering stress
  selftests: cgroup: Add task migration tests
  ...
2019-11-25 19:23:46 -08:00
David S. Miller
adf6f8cb3f Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge in networking bug fixes for merge window.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-25 14:57:26 -08:00
Xin Long
312434617c sctp: cache netns in sctp_ep_common
This patch is to fix a data-race reported by syzbot:

  BUG: KCSAN: data-race in sctp_assoc_migrate / sctp_hash_obj

  write to 0xffff8880b67c0020 of 8 bytes by task 18908 on cpu 1:
    sctp_assoc_migrate+0x1a6/0x290 net/sctp/associola.c:1091
    sctp_sock_migrate+0x8aa/0x9b0 net/sctp/socket.c:9465
    sctp_accept+0x3c8/0x470 net/sctp/socket.c:4916
    inet_accept+0x7f/0x360 net/ipv4/af_inet.c:734
    __sys_accept4+0x224/0x430 net/socket.c:1754
    __do_sys_accept net/socket.c:1795 [inline]
    __se_sys_accept net/socket.c:1792 [inline]
    __x64_sys_accept+0x4e/0x60 net/socket.c:1792
    do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

  read to 0xffff8880b67c0020 of 8 bytes by task 12003 on cpu 0:
    sctp_hash_obj+0x4f/0x2d0 net/sctp/input.c:894
    rht_key_get_hash include/linux/rhashtable.h:133 [inline]
    rht_key_hashfn include/linux/rhashtable.h:159 [inline]
    rht_head_hashfn include/linux/rhashtable.h:174 [inline]
    head_hashfn lib/rhashtable.c:41 [inline]
    rhashtable_rehash_one lib/rhashtable.c:245 [inline]
    rhashtable_rehash_chain lib/rhashtable.c:276 [inline]
    rhashtable_rehash_table lib/rhashtable.c:316 [inline]
    rht_deferred_worker+0x468/0xab0 lib/rhashtable.c:420
    process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
    worker_thread+0xa0/0x800 kernel/workqueue.c:2415
    kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
    ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

It was caused by rhashtable access asoc->base.sk when sctp_assoc_migrate
is changing its value. However, what rhashtable wants is netns from asoc
base.sk, and for an asoc, its netns won't change once set. So we can
simply fix it by caching netns since created.

Fixes: d6c0256a60 ("sctp: add the rhashtable apis for sctp global transport hashtable")
Reported-by: syzbot+e3b35fe7918ff0ee474e@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-23 18:26:14 -08:00
Maciej Żenczykowski
30429fba99 net: inet_is_local_reserved_port() should return bool not int
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22 16:52:47 -08:00
Jakub Kicinski
a9f852e92e Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in drivers/s390/net/qeth_l2_main.c, kept the lock
from commit c8183f5489 ("s390/qeth: fix potential deadlock on
workqueue flush"), removed the code which was removed by commit
9897d583b0 ("s390/qeth: consolidate some duplicated HW cmd code").

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-11-22 16:27:24 -08:00
Petr Machata
d1746d1e80 net: flow_dissector: Wrap unionized VLAN fields in a struct
In commit a82055af59 ("netfilter: nft_payload: add VLAN offload
support"), VLAN fields in struct flow_dissector_key_vlan were unionized
with the intention of introducing another field that covered the whole TCI
header. However without a wrapping struct the subfields end up sharing the
same bits. As a result, "tc filter add ... flower vlan_id 14" specifies not
only vlan_id, but also vlan_priority.

Fix by wrapping the individual VLAN fields in a struct.

Fixes: a82055af59 ("netfilter: nft_payload: add VLAN offload support")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-22 09:44:14 -08:00
Toke Høiland-Jørgensen
7a89233ac5 mac80211: Use Airtime-based Queue Limits (AQL) on packet dequeue
The previous commit added the ability to throttle stations when they queue
too much airtime in the hardware. This commit enables the functionality by
calculating the expected airtime usage of each packet that is dequeued from
the TXQs in mac80211, and accounting that as pending airtime.

The estimated airtime for each skb is stored in the tx_info, so we can
subtract the same amount from the running total when the skb is freed or
recycled. The throttling mechanism relies on this accounting to be
accurate (i.e., that we are not freeing skbs without subtracting any
airtime they were accounted for), so we put the subtraction into
ieee80211_report_used_skb(). As an optimisation, we also subtract the
airtime on regular TX completion, zeroing out the value stored in the
packet afterwards, to avoid having to do an expensive lookup of the station
from the packet data on every packet.

This patch does *not* include any mechanism to wake a throttled TXQ again,
on the assumption that this will happen anyway as a side effect of whatever
freed the skb (most commonly a TX completion).

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-5-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-22 13:36:25 +01:00
Kan Yan
3ace10f5b5 mac80211: Implement Airtime-based Queue Limit (AQL)
In order for the Fq_CoDel algorithm integrated in mac80211 layer to operate
effectively to control excessive queueing latency, the CoDel algorithm
requires an accurate measure of how long packets stays in the queue, AKA
sojourn time. The sojourn time measured at the mac80211 layer doesn't
include queueing latency in the lower layer (firmware/hardware) and CoDel
expects lower layer to have a short queue. However, most 802.11ac chipsets
offload tasks such TX aggregation to firmware or hardware, thus have a deep
lower layer queue.

Without a mechanism to control the lower layer queue size, packets only
stay in mac80211 layer transiently before being sent to firmware queue.
As a result, the sojourn time measured by CoDel in the mac80211 layer is
almost always lower than the CoDel latency target, hence CoDel does little
to control the latency, even when the lower layer queue causes excessive
latency.

The Byte Queue Limits (BQL) mechanism is commonly used to address the
similar issue with wired network interface. However, this method cannot be
applied directly to the wireless network interface. "Bytes" is not a
suitable measure of queue depth in the wireless network, as the data rate
can vary dramatically from station to station in the same network, from a
few Mbps to over Gbps.

This patch implements an Airtime-based Queue Limit (AQL) to make CoDel work
effectively with wireless drivers that utilized firmware/hardware
offloading. AQL allows each txq to release just enough packets to the lower
layer to form 1-2 large aggregations to keep hardware fully utilized and
retains the rest of the frames in mac80211 layer to be controlled by the
CoDel algorithm.

Signed-off-by: Kan Yan <kyan@google.com>
[ Toke: Keep API to set pending airtime internal, fix nits in commit msg ]
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-4-kyan@google.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-22 13:36:25 +01:00
Toke Høiland-Jørgensen
db3e1c40cf mac80211: Import airtime calculation code from mt76
Felix recently added code to calculate airtime of packets to the mt76
driver. Import this into mac80211 so we can use it for airtime queue limit
calculations.

The airtime.c file is copied verbatim from the mt76 driver, and adjusted to
be usable in mac80211. This involves:

- Switching to mac80211 data structures.
- Adding support for 160 MHz channels and HE mode.
- Moving the symbol and duration calculations around a bit to avoid
  rounding with the higher rates and longer symbol times used for HE rates.

The per-rate TX rate calculation is also split out to its own function so
it can be used directly for the AQL calculations later.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191119060610.76681-3-kyan@google.com
[fix HE_GROUP_IDX() to use 3 * bw, since there are 3 _gi values]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-22 13:36:25 +01:00
Paolo Abeni
02b2494161 ipv4: use dst hint for ipv4 list receive
This is alike the previous change, with some additional ipv4 specific
quirk. Even when using the route hint we still have to do perform
additional per packet checks about source address validity: a new
helper is added to wrap them.

Hints are explicitly disabled if the destination is a local broadcast,
that keeps the code simple and local broadcast are a slower path anyway.

UDP flood performances vs recvmmsg() receiver:

vanilla		patched		delta
Kpps		Kpps		%
1683		1871		+11

In the worst case scenario - each packet has a different
destination address - the performance delta is within noise
range.

v3 -> v4:
 - re-enable hints for forward

v2 -> v3:
 - really fix build (sic) and hint usage check
 - use fib4_has_custom_rules() helpers (David A.)
 - add ip_extract_route_hint() helper (Edward C.)
 - use prev skb as hint instead of copying data (Willem)

v1 -> v2:
 - fix build issue with !CONFIG_IP_MULTIPLE_TABLES

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21 14:45:55 -08:00
Paolo Abeni
c43c3d76c0 ipv4: move fib4_has_custom_rules() helper to public header
So that we can use it in the next patch.
Additionally constify the helper argument.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21 14:45:55 -08:00
Paolo Abeni
b9b33e7c24 ipv6: keep track of routes using src
Use a per namespace counter, increment it on successful creation
of any route using the source address, decrement it on deletion
of such routes.

This allows us to check easily if the routing decision in the
current namespace depends on the packet source. Will be used
by the next patch.

Suggested-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21 14:45:55 -08:00
Paolo Abeni
1f8ac57030 ipv6: add fib6_has_custom_rules() helper
It wraps the namespace field with the same name, to easily
access it regardless of build options.

Suggested-by: David Ahern <dsahern@gmail.com>
Suggested-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-21 14:45:55 -08:00
Lorenzo Bianconi
e68bc75691 net: page_pool: add the possibility to sync DMA memory for device
Introduce the following parameters in order to add the possibility to sync
DMA memory for device before putting allocated pages in the page_pool
caches:
- PP_FLAG_DMA_SYNC_DEV: if set in page_pool_params flags, all pages that
  the driver gets from page_pool will be DMA-synced-for-device according
  to the length provided by the device driver. Please note DMA-sync-for-CPU
  is still device driver responsibility
- offset: DMA address offset where the DMA engine starts copying rx data
- max_len: maximum DMA memory size page_pool is allowed to flush. This
  is currently used in __page_pool_alloc_pages_slow routine when pages
  are allocated from page allocator
These parameters are supposed to be set by device drivers.

This optimization reduces the length of the DMA-sync-for-device.
The optimization is valid because pages are initially
DMA-synced-for-device as defined via max_len. At RX time, the driver
will perform a DMA-sync-for-CPU on the memory for the packet length.
What is important is the memory occupied by packet payload, because
this is the area CPU is allowed to read and modify. As we don't track
cache-lines written into by the CPU, simply use the packet payload length
as dma_sync_size at page_pool recycle time. This also take into account
any tail-extend.

Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 12:34:28 -08:00
Saeed Mahameed
bc83674870 page_pool: Add API to update numa node
Add page_pool_update_nid() to be called by page pool consumers when they
detect numa node changes.

It will update the page pool nid value to start allocating from the new
effective numa node.

This is to mitigate page pool allocating pages from a wrong numa node,
where the pool was originally allocated, and holding on to pages that
belong to a different numa node, which causes performance degradation.

For pages that are already being consumed and could be returned to the
pool by the consumer, in next patch we will add a check per page to avoid
recycling them back to the pool and return them to the page allocator.

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 11:47:36 -08:00
Pablo Neira Ayuso
a82055af59 netfilter: nft_payload: add VLAN offload support
Match on ethertype and set up protocol dependency. Check for protocol
dependency before accessing the tci field. Allow to match on the
encapsulated ethertype too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 11:21:34 -08:00
Pablo Neira Ayuso
8819efc943 netfilter: nf_tables_offload: allow ethernet interface type only
Hardware offload support at this stage assumes an ethernet device in
place. The flow dissector provides the intermediate representation to
express this selector, so extend it to allow to store the interface
type. Flower does not uses this, so skb_flow_dissect_meta() is not
extended to match on this new field.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 11:21:34 -08:00
Pablo Neira Ayuso
7cd9a58d68 netfilter: nf_tables: constify nft_reg_load{8, 16, 64}()
This patch constifies the pointer to source register data that is passed
as an input parameter.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-20 11:21:34 -08:00
Willem de Bruijn
d4ffb02dee net/tls: enable sk_msg redirect to tls socket egress
Bring back tls_sw_sendpage_locked. sk_msg redirection into a socket
with TLS_TX takes the following path:

  tcp_bpf_sendmsg_redir
    tcp_bpf_push_locked
      tcp_bpf_push
        kernel_sendpage_locked
          sock->ops->sendpage_locked

Also update the flags test in tls_sw_sendpage_locked to allow flag
MSG_NO_SHARED_FRAGS. bpf_tcp_sendmsg sets this.

Link: https://lore.kernel.org/netdev/CA+FuTSdaAawmZ2N8nfDDKu3XLpXBbMtcCT0q4FntDD2gn8ASUw@mail.gmail.com/T/#t
Link: https://github.com/wdebruij/kerneltools/commits/icept.2
Fixes: 0608c69c9a ("bpf: sk_msg, sock{map|hash} redirect through ULP")
Fixes: f3de19af0f ("Revert \"net/tls: remove unused function tls_sw_sendpage_locked\"")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-19 15:03:02 -08:00
Jesper Dangaard Brouer
7c9e69428d page_pool: add destroy attempts counter and rename tracepoint
When Jonathan change the page_pool to become responsible to its
own shutdown via deferred work queue, then the disconnect_cnt
counter was removed from xdp memory model tracepoint.

This patch change the page_pool_inflight tracepoint name to
page_pool_release, because it reflects the new responsability
better.  And it reintroduces a counter that reflect the number of
times page_pool_release have been tried.

The counter is also used by the code, to only empty the alloc
cache once.  With a stuck work queue running every second and
counter being 64-bit, it will overrun in approx 584 billion
years. For comparison, Earth lifetime expectancy is 7.5 billion
years, before the Sun will engulf, and destroy, the Earth.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-18 17:03:18 -08:00
David S. Miller
99638e9d6c Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Wildcard support for the net,iface set from Kristian Evensen.

2) Offload support for matching on the input interface.

3) Simplify matching on vlan header fields.

4) Add nft_payload_rebuild_vlan_hdr() function to rebuild the vlan
   header from the vlan sk_buff metadata.

5) Pass extack to nft_flow_cls_offload_setup().

6) Add C-VLAN matching support.

7) Use time64_t in xt_time to fix y2038 overflow, from Arnd Bergmann.

8) Use time_t in nft_meta to fix y2038 overflow, also from Arnd.

9) Add flow_action_entry_next() helper function to flowtable offload
   infrastructure.

10) Add IPv6 support to the flowtable offload infrastructure.

11) Support for input interface matching from postrouting,
    from Phil Sutter.

12) Missing check for ndo callback in flowtable offload, from wenxu.

13) Remove conntrack parameter from flow_offload_fill_dir(), from wenxu.

14) Do not pass flow_rule object for rule removal, cookie is sufficient
    to achieve this.

15) Release flow_rule object in case of error from the offload commit
    path.

16) Undo offload ruleset updates if transaction fails.

17) Check for error when binding flowtable callbacks, from wenxu.

18) Always unbind flowtable callbacks when unregistering hooks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-18 16:43:05 -08:00
Jonathan Lemon
c3f812cea0 page_pool: do not release pool until inflight == 0.
The page pool keeps track of the number of pages in flight, and
it isn't safe to remove the pool until all pages are returned.

Disallow removing the pool until all pages are back, so the pool
is always available for page producers.

Make the page pool responsible for its own delayed destruction
instead of relying on XDP, so the page pool can be used without
the xdp memory model.

When all pages are returned, free the pool and notify xdp if the
pool is registered with the xdp memory system.  Have the callback
perform a table walk since some drivers (cpsw) may share the pool
among multiple xdp_rxq_info.

Note that the increment of pages_state_release_cnt may result in
inflight == 0, resulting in the pool being released.

Fixes: d956a048cd ("xdp: force mem allocator removal and periodic warning")
Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-16 12:39:10 -08:00
Pablo Neira Ayuso
5c27d8d76c netfilter: nf_flow_table_offload: add IPv6 support
Add nf_flow_rule_route_ipv6() and use it from the IPv6 and the inet
flowtable type definitions. Rename the nf_flow_rule_route() function to
nf_flow_rule_route_ipv4().

Adjust maximum number of actions, which now becomes 16 to leave
sufficient room for the IPv6 address mangling for NAT.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-15 23:44:47 +01:00
Vladimir Oltean
8dce89aa5f net: dsa: ocelot: add tagger for Ocelot/Felix switches
While it is entirely possible that this tagger format is in fact more
generic than just these 2 switch families, I don't have that knowledge.
The Seville switch in NXP T1040 has a similar frame format, but there
are enough differences (e.g. DEST field starts at bit 57 instead of 56)
that calling this file tag_vitesse.c is a bit of a stretch at the
moment. The frame format has been listed in a comment so that people who
add support for further Vitesse switches can rework this tagger while
keeping compatibility with Felix.

The "ocelot" name was chosen instead of "felix" because even the Ocelot
switch can act as a DSA device when it is used in NPI mode, and the Felix
tagger format is almost identical. Currently it is only used for the
Felix switch embedded in the NXP LS1028A chip.

The ABI for this tagger should be considered "not stable" at the moment.
The DSA tag is always placed before the Ethernet header and therefore,
we are using the long prefix for RX tags to avoid putting the DSA master
port in promiscuous mode. Once there will be an API in DSA for drivers
to request DSA masters to be in promiscuous mode unconditionally, we
will switch to the "no prefix" extraction frame header, which will save
16 padding bytes for each RX frame.

Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:32:16 -08:00
Ursula Braun
5edd6b9cb8 net/smc: introduce bookkeeping of SMCD link groups
If the ism module is unloaded return control from exit routine only,
if all link groups are freed.
If an IB device is thrown away return control from device removal only,
if all link groups belonging to this device are freed.
A counters for the total number of SMCD link groups per ISM device is
introduced. ism module unloading continues only if the total number of
SMCD link groups for all ISM devices is zero. ISM device
removal continues only it the total number of SMCD link groups per ISM
device has decreased to zero.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:28:28 -08:00
Ursula Braun
42bfba9eaa net/smc: immediate termination for SMCD link groups
SMCD link group termination is called when peer signals its shutdown
of its corresponding link group. For regular shutdowns no connections
exist anymore. For abnormal shutdowns connections must be killed and
their DMBs must be unregistered immediately. That means the SMCR method
to delay the link group freeing several seconds does not fit.

This patch adds immediate termination of a link group and its SMCD
connections and makes sure all SMCD link group related cleanup steps
are finished.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-15 12:28:28 -08:00
Stefano Garzarella
6a2c096210 vsock: prevent transport modules unloading
This patch adds 'module' member in the 'struct vsock_transport'
in order to get/put the transport module. This prevents the
module unloading while sockets are assigned to it.

We increase the module refcnt when a socket is assigned to a
transport, and we decrease the module refcnt when the socket
is destructed.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:18 -08:00
Stefano Garzarella
c0cfa2d8a7 vsock: add multi-transports support
This patch adds the support of multiple transports in the
VSOCK core.

With the multi-transports support, we can use vsock with nested VMs
(using also different hypervisors) loading both guest->host and
host->guest transports at the same time.

Major changes:
- vsock core module can be loaded regardless of the transports
- vsock_core_init() and vsock_core_exit() are renamed to
  vsock_core_register() and vsock_core_unregister()
- vsock_core_register() has a feature parameter (H2G, G2H, DGRAM)
  to identify which directions the transport can handle and if it's
  support DGRAM (only vmci)
- each stream socket is assigned to a transport when the remote CID
  is set (during the connect() or when we receive a connection request
  on a listener socket).
  The remote CID is used to decide which transport to use:
  - remote CID <= VMADDR_CID_HOST will use guest->host transport;
  - remote CID == local_cid (guest->host transport) will use guest->host
    transport for loopback (host->guest transports don't support loopback);
  - remote CID > VMADDR_CID_HOST will use host->guest transport;
- listener sockets are not bound to any transports since no transport
  operations are done on it. In this way we can create a listener
  socket, also if the transports are not loaded or with VMADDR_CID_ANY
  to listen on all transports.
- DGRAM sockets are handled as before, since only the vmci_transport
  provides this feature.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:18 -08:00
Stefano Garzarella
b9ca2f5ff7 vsock: add vsock_create_connected() called by transports
All transports call __vsock_create() with the same parameters,
most of them depending on the parent socket. In order to simplify
the VSOCK core APIs exposed to the transports, this patch adds
the vsock_create_connected() callable from transports to create
a new socket when a connection request is received.
We also unexported the __vsock_create().

Suggested-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:18 -08:00
Stefano Garzarella
b9f2b0ffde vsock: handle buffer_size sockopts in the core
virtio_transport and vmci_transport handle the buffer_size
sockopts in a very similar way.

In order to support multiple transports, this patch moves this
handling in the core to allow the user to change the options
also if the socket is not yet assigned to any transport.

This patch also adds the '.notify_buffer_size' callback in the
'struct virtio_transport' in order to inform the transport,
when the buffer_size is changed by the user. It is also useful
to limit the 'buffer_size' requested (e.g. virtio transports).

Acked-by: Dexuan Cui <decui@microsoft.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:18 -08:00
Stefano Garzarella
daabfbca34 vsock: add 'struct vsock_sock *' param to vsock_core_get_transport()
Since now the 'struct vsock_sock' object contains a pointer to
the transport, this patch adds a parameter to the
vsock_core_get_transport() to return the right transport
assigned to the socket.

This patch modifies also the virtio_transport_get_ops(), that
uses the vsock_core_get_transport(), adding the
'struct vsock_sock *' parameter.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:18 -08:00
Stefano Garzarella
fe502c4a38 vsock: add 'transport' member in the struct vsock_sock
As a preparation to support multiple transports, this patch adds
the 'transport' member at the 'struct vsock_sock'.
This new field is initialized during the creation in the
__vsock_create() function.

This patch also renames the global 'transport' pointer to
'transport_single', since for now we're only supporting a single
transport registered at run-time.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:18 -08:00
Stefano Garzarella
3603a2e991 vsock: remove include/linux/vm_sockets.h file
This header file now only includes the "uapi/linux/vm_sockets.h".
We can include directly it when needed.

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-14 18:12:17 -08:00
Saeed Mahameed
c94ef13b04 Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
1) New generic devlink param "enable_roce", for downstream devlink
   reload support

2) Do vport ACL configuration on per vport basis when
   enabling/disabling a vport. This enables to have vports enabled/disabled
   outside of eswitch config for future

3) Split the code for legacy vs offloads mode and make it clear

4) Tide up vport locking and workqueue usage

5) Fix metadata enablement for ECPF

6) Make explicit use of VF property to publish IB_DEVICE_VIRTUAL_FUNCTION

7) E-Switch and flow steering core low level support and refactoring for
   netfilter flowtables offload

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-11-13 14:24:58 -08:00
David S. Miller
895badff2c Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2019-11-13

1) Remove a unnecessary net_exit function from the xfrm interface.
   From Xin Long.

2) Assign xfrm4_udp_encap_rcv to a UDP socket only if xfrm
   is configured. From Alexey Dobriyan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-13 11:28:54 -08:00
Pablo Neira Ayuso
25da5eb32c netfilter: nft_meta: offload support for interface index
This patch adds support for offloading the NFT_META_IIF selector.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-13 10:41:34 +01:00
Pablo Neira Ayuso
c29f74e0df netfilter: nf_flow_table: hardware offload support
This patch adds the dataplane hardware offload to the flowtable
infrastructure. Three new flags represent the hardware state of this
flow:

* FLOW_OFFLOAD_HW: This flow entry resides in the hardware.
* FLOW_OFFLOAD_HW_DYING: This flow entry has been scheduled to be remove
  from hardware. This might be triggered by either packet path (via TCP
  RST/FIN packet) or via aging.
* FLOW_OFFLOAD_HW_DEAD: This flow entry has been already removed from
  the hardware, the software garbage collector can remove it from the
  software flowtable.

This patch supports for:

* IPv4 only.
* Aging via FLOW_CLS_STATS, no packet and byte counter synchronization
  at this stage.

This patch also adds the action callback that specifies how to convert
the flow entry into the flow_rule object that is passed to the driver.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso
8bb69f3b29 netfilter: nf_tables: add flowtable offload control plane
This patch adds the NFTA_FLOWTABLE_FLAGS attribute that allows users to
specify the NF_FLOWTABLE_HW_OFFLOAD flag. This patch also adds a new
setup interface for the flowtable type to perform the flowtable offload
block callback configuration.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso
f1363e058b netfilter: nf_flow_table: detach routing information from flow description
This patch adds the infrastructure to support for flow entry types.
The initial type is NF_FLOW_OFFLOAD_ROUTE that stores the routing
information into the flow entry to define a fastpath for the classic
forwarding path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso
62248df88a netfilter: nf_flowtable: remove flow_offload_entry structure
Move rcu_head to struct flow_offload, then remove the flow_offload_entry
structure definition.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso
9f48e9bf25 netfilter: nf_flow_table: remove union from flow_offload structure
Drivers do not have access to the flow_offload structure, hence remove
this union from this flow_offload object as well as the original comment
on top of it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Pablo Neira Ayuso
b32d2f3416 netfilter: nf_flow_table: move conntrack object to struct flow_offload
Simplify this code by storing the pointer to conntrack object in the
flow_offload structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 19:42:26 -08:00
Davide Caratti
e0e2b35b79 net/sched: actions: remove unused 'order'
after commit 4097e9d250 ("net: sched: don't use tc_action->order during
action dump"), 'act->order' is initialized but then it's no more read, so
we can just remove this member of struct tc_action.

CC: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 12:11:22 -08:00
Aya Levin
e2cde864a1 devlink: Allow large formatted message of binary output
Devlink supports pair output of name and value. When the value is
binary, it must be presented in an array. If the length of the binary
value exceeds fmsg limitation, break the value into chunks internally.

Signed-off-by: Aya Levin <ayal@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-12 11:25:44 -08:00
Tejun Heo
db53c73a8b netprio: use css ID instead of cgroup ID
netprio uses cgroup ID to index the priority mapping table.  This is
currently okay as cgroup IDs are allocated using idr and packed.
However, cgroup IDs will be changed to use full 64bit range and won't
be packed making this impractical.  netprio doesn't care what type of
IDs it uses as long as they can identify the controller instances and
are packed.  Let's switch to css IDs instead of cgroup IDs.

Signed-off-by: Tejun Heo <tj@kernel.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Namhyung Kim <namhyung@kernel.org>
2019-11-12 08:18:03 -08:00
Michael Guralnik
6c7295e13f devlink: Add new "enable_roce" generic device param
New device parameter to enable/disable handling of RoCE traffic in the
device.

Signed-off-by: Michael Guralnik <michaelgur@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Maor Gottlieb <maorg@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-11-11 12:15:29 -08:00
Jiri Pirko
5a508a254b devlink: disallow reload operation during device cleanup
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:

while true; do
        echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/bind
        devlink dev reload pci/0000:00:10.0 &
        echo "0000:00:10.0" >/sys/bus/pci/drivers/mlxsw_spectrum2/unbind
done

Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.

Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09 19:38:36 -08:00
David S. Miller
14684b9301 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
One conflict in the BPF samples Makefile, some fixes in 'net' whilst
we were converting over to Makefile.target rules in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-09 11:04:37 -08:00
Xin Long
34515e94c9 sctp: add support for Primary Path Switchover
This is a new feature defined in section 5 of rfc7829: "Primary Path
Switchover". By introducing a new tunable parameter:

  Primary.Switchover.Max.Retrans (PSMR)

The primary path will be changed to another active path when the path
error counter on the old primary path exceeds PSMR, so that "the SCTP
sender is allowed to continue data transmission on a new working path
even when the old primary destination address becomes active again".

This patch is to add this tunable parameter, 'ps_retrans' per netns,
sock, asoc and transport. It also allows a user to change ps_retrans
per netns by sysctl, and ps_retrans per sock/asoc/transport will be
initialized with it.

The check will be done in sctp_do_8_2_transport_strike() when this
feature is enabled.

Note this feature is disabled by initializing 'ps_retrans' per netns
as 0xffff by default, and its value can't be less than 'pf_retrans'
when changing by sysctl.

v3->v4:
  - add define SCTP_PS_RETRANS_MAX 0xffff, and use it on extra2 of
    sysctl 'ps_retrans'.
  - add a new entry for ps_retrans on ip-sysctl.txt.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Xin Long
aef587be42 sctp: add pf_expose per netns and sock and asoc
As said in rfc7829, section 3, point 12:

  The SCTP stack SHOULD expose the PF state of its destination
  addresses to the ULP as well as provide the means to notify the
  ULP of state transitions of its destination addresses from
  active to PF, and vice versa.  However, it is recommended that
  an SCTP stack implementing SCTP-PF also allows for the ULP to be
  kept ignorant of the PF state of its destinations and the
  associated state transitions, thus allowing for retention of the
  simpler state transition model of [RFC4960] in the ULP.

Not only does it allow to expose the PF state to ULP, but also
allow to ignore sctp-pf to ULP.

So this patch is to add pf_expose per netns, sock and asoc. And in
sctp_assoc_control_transport(), ulp_notify will be set to false if
asoc->expose is not 'enabled' in next patch.

It also allows a user to change pf_expose per netns by sysctl, and
pf_expose per sock and asoc will be initialized with it.

Note that pf_expose also works for SCTP_GET_PEER_ADDR_INFO sockopt,
to not allow a user to query the state of a sctp-pf peer address
when pf_expose is 'disabled', as said in section 7.3.

v1->v2:
  - Fix a build warning noticed by Nathan Chancellor.
v2->v3:
  - set pf_expose to UNUSED by default to keep compatible with old
    applications.
v3->v4:
  - add a new entry for pf_expose on ip-sysctl.txt, as Marcelo suggested.
  - change this patch to 1/5, and move sctp_assoc_control_transport
    change into 2/5, as Marcelo suggested.
  - use SCTP_PF_EXPOSE_UNSET instead of SCTP_PF_EXPOSE_UNUSED, and
    set SCTP_PF_EXPOSE_UNSET to 0 in enum, as Marcelo suggested.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:18:32 -08:00
Jiri Pirko
a0c76345e3 devlink: disallow reload operation during device cleanup
There is a race between driver code that does setup/cleanup of device
and devlink reload operation that in some drivers works with the same
code. Use after free could we easily obtained by running:

while true; do
        echo 10 > /sys/bus/netdevsim/new_device
        devlink dev reload netdevsim/netdevsim10 &
        echo 10 > /sys/bus/netdevsim/del_device
done

Fix this by enabling reload only after setup of device is complete and
disabling it at the beginning of the cleanup process.

Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: 2d8dc5bbf4 ("devlink: Add support for reload")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 14:11:47 -08:00
Eric Dumazet
1b53d64435 net: fix data-race in neigh_event_send()
KCSAN reported the following data-race [1]

The fix will also prevent the compiler from optimizing out
the condition.

[1]

BUG: KCSAN: data-race in neigh_resolve_output / neigh_resolve_output

write to 0xffff8880a41dba78 of 8 bytes by interrupt on cpu 1:
 neigh_event_send include/net/neighbour.h:443 [inline]
 neigh_resolve_output+0x78/0x480 net/core/neighbour.c:1474
 neigh_output include/net/neighbour.h:511 [inline]
 ip_finish_output2+0x4af/0xe40 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x23a/0x490 net/ipv4/ip_output.c:290
 ip_finish_output+0x41/0x160 net/ipv4/ip_output.c:318
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip_output+0xdf/0x210 net/ipv4/ip_output.c:432
 dst_output include/net/dst.h:436 [inline]
 ip_local_out+0x74/0x90 net/ipv4/ip_output.c:125
 __ip_queue_xmit+0x3a8/0xa40 net/ipv4/ip_output.c:532
 ip_queue_xmit+0x45/0x60 include/net/ip.h:237
 __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
 tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
 __tcp_retransmit_skb+0x4bd/0x15f0 net/ipv4/tcp_output.c:2976
 tcp_retransmit_skb+0x36/0x1a0 net/ipv4/tcp_output.c:2999
 tcp_retransmit_timer+0x719/0x16d0 net/ipv4/tcp_timer.c:515
 tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:598
 tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:618

read to 0xffff8880a41dba78 of 8 bytes by interrupt on cpu 0:
 neigh_event_send include/net/neighbour.h:442 [inline]
 neigh_resolve_output+0x57/0x480 net/core/neighbour.c:1474
 neigh_output include/net/neighbour.h:511 [inline]
 ip_finish_output2+0x4af/0xe40 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x23a/0x490 net/ipv4/ip_output.c:290
 ip_finish_output+0x41/0x160 net/ipv4/ip_output.c:318
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip_output+0xdf/0x210 net/ipv4/ip_output.c:432
 dst_output include/net/dst.h:436 [inline]
 ip_local_out+0x74/0x90 net/ipv4/ip_output.c:125
 __ip_queue_xmit+0x3a8/0xa40 net/ipv4/ip_output.c:532
 ip_queue_xmit+0x45/0x60 include/net/ip.h:237
 __tcp_transmit_skb+0xe81/0x1d60 net/ipv4/tcp_output.c:1169
 tcp_transmit_skb net/ipv4/tcp_output.c:1185 [inline]
 __tcp_retransmit_skb+0x4bd/0x15f0 net/ipv4/tcp_output.c:2976
 tcp_retransmit_skb+0x36/0x1a0 net/ipv4/tcp_output.c:2999
 tcp_retransmit_timer+0x719/0x16d0 net/ipv4/tcp_timer.c:515
 tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:598

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 13:59:53 -08:00
Eric Dumazet
90b2be27bb net/sched: annotate lockless accesses to qdisc->empty
KCSAN reported the following race [1]

BUG: KCSAN: data-race in __dev_queue_xmit / net_tx_action

read to 0xffff8880ba403508 of 1 bytes by task 21814 on cpu 1:
 __dev_xmit_skb net/core/dev.c:3389 [inline]
 __dev_queue_xmit+0x9db/0x1b40 net/core/dev.c:3761
 dev_queue_xmit+0x21/0x30 net/core/dev.c:3825
 neigh_hh_output include/net/neighbour.h:500 [inline]
 neigh_output include/net/neighbour.h:509 [inline]
 ip6_finish_output2+0x873/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

write to 0xffff8880ba403508 of 1 bytes by interrupt on cpu 0:
 qdisc_run_begin include/net/sch_generic.h:160 [inline]
 qdisc_run include/net/pkt_sched.h:120 [inline]
 net_tx_action+0x2b1/0x6c0 net/core/dev.c:4551
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189
 local_bh_enable include/linux/bottom_half.h:32 [inline]
 rcu_read_unlock_bh include/linux/rcupdate.h:688 [inline]
 ip6_finish_output2+0x7bb/0xec0 net/ipv6/ip6_output.c:117
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 ip6_local_out+0x74/0x90 net/ipv6/output_core.c:179
 ip6_send_skb+0x53/0x110 net/ipv6/ip6_output.c:1795
 udp_v6_send_skb.isra.0+0x3ec/0xa70 net/ipv6/udp.c:1173
 udpv6_sendmsg+0x1906/0x1c20 net/ipv6/udp.c:1471
 inet6_sendmsg+0x6d/0x90 net/ipv6/af_inet6.c:576
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0x9f/0xc0 net/socket.c:657
 ___sys_sendmsg+0x2b7/0x5d0 net/socket.c:2311
 __sys_sendmmsg+0x123/0x350 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x64/0x80 net/socket.c:2439
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 21817 Comm: syz-executor.2 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: d518d2ed86 ("net/sched: fix race between deactivation and dequeue for NOLOCK qdisc")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 12:21:55 -08:00
David S. Miller
5bd2ce6aa5 Some relatively small changes:
* typo fixes in docs
  * APIs for station separation using VLAN tags rather
    than separate wifi netdevs
  * some preparations for upcoming features (802.3 offload
    and airtime queue limits (AQL)
  * stack reduction in ieee80211_assoc_success()
  * use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl3FYNEACgkQB8qZga/f
 l8TQig//aZIXq2qlrci+Tzydi5clhrog8mzGBOECRUrgAFUZB9Q1qSD0IvqQrU8o
 3xY+9lDxYvaOYKAm83oyk2st8FIgGXwPmrfPMEodFlFWla4Dkev/n8+CdbVgJDGI
 9j0+9tPsQFwvIIOwD1OUKzJyRUIwmJ7+f8DRhMZ8LlkHZwSx217yD1RGq8zbyXUf
 tIndHM2yAY0H9ip1GPaTiGMyWN9wbpcllNq6h126OJZ0kGRLbY9X9ySoMIw8zSPF
 ul1qofvhcdN8BEOqCd97JIsDF6d3fG/LZyNm+UM415C/Se9bSG1O9i9AakGkk2SW
 zqSjkhqeBK5DUfMqGlh6uxx+1CH46vTZhBKuYIe2uWrdia14d88qkQE88R4cRcir
 dTJHUP90XOXXh3DrDSgEsfb341E5/IMVgUIIu6U72efNqZ8eMCbsXnp8Bnpervtw
 sBPYC4aTzU5wBWbl8XDgm0KWx+nh06Ah5+PXW7gzkKZQpW8qOJp4ATcDkqHrMPBf
 OFd8zhYKULBVRH8dsshzXdeVOurNwazJnEN/h72U1eZTa7ZxFL+CMShpY20d4t9I
 ldh6EG//b/D7WjBIpbs3VBBQ6fnblwtq41GUS0TgndoDUD1wmBf3mJL9ZvEsYHwC
 OmsTaAcsAkHBPdOZDpSyyGYddxXp7YNCT6ua3lNdRAbVn5Y0A/8=
 =gDUI
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2019-11-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Some relatively small changes:
 * typo fixes in docs
 * APIs for station separation using VLAN tags rather
   than separate wifi netdevs
 * some preparations for upcoming features (802.3 offload
   and airtime queue limits (AQL)
 * stack reduction in ieee80211_assoc_success()
 * use DEFINE_DEBUGFS_ATTRIBUTE in hwsim
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-08 11:36:48 -08:00
Gurumoorthi Gnanasambandhan
14f34e36b3 cfg80211: VLAN offload support for set_key and set_sta_vlan
This provides an alternative mechanism for AP VLAN support where a
single netdev is used with VLAN tagged frames instead of separate
netdevs for each VLAN without tagged frames from the WLAN driver.

By setting NL80211_EXT_FEATURE_VLAN_OFFLOAD flag the driver indicates
support for a single netdev with VLAN tagged frames. Separate
VLAN-specific netdevs can be added using RTM_NEWLINK/IFLA_VLAN_ID
similarly to Ethernet. NL80211_CMD_NEW_KEY (for group keys),
NL80211_CMD_NEW_STATION, and NL80211_CMD_SET_STATION will optionally
specify vlan_id using NL80211_ATTR_VLAN_ID.

Signed-off-by: Gurumoorthi Gnanasambandhan <gguru@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Link: https://lore.kernel.org/r/20191031214640.5012-1-jouni@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 11:19:19 +01:00
Toke Høiland-Jørgensen
6912daed05 mac80211: Shrink the size of ack_frame_id to make room for tx_time_est
To implement airtime queue limiting, we need to keep a running account of
the estimated airtime of all skbs queued into the device. Do to this
correctly, we need to store the airtime estimate into the skb so we can
decrease the outstanding balance when the skb is freed. This means that the
time estimate must be stored somewhere that will survive for the lifetime
of the skb.

To get this, decrease the size of the ack_frame_id field to 6 bits, and
lower the size of the ID space accordingly. This leaves 10 bits for use for
tx_time_est, which is enough to store a maximum of 4096 us, if we shift the
values so they become units of 4us.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/157182474063.150713.16132669599100802716.stgit@toke.dk
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 10:19:19 +01:00
Toke Høiland-Jørgensen
71e67c3bd1 net/fq_impl: Switch to kvmalloc() for memory allocation
The FQ implementation used by mac80211 allocates memory using kmalloc(),
which can fail; and Johannes reported that this actually happens in
practice.

To avoid this, switch the allocation to kvmalloc() instead; this also
brings fq_impl in line with all the FQ qdiscs.

Fixes: 557fc4a098 ("fq: add fair queuing framework")
Reported-by: Johannes Berg <johannes@sipsolutions.net>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Link: https://lore.kernel.org/r/20191105155750.547379-1-toke@redhat.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-11-08 09:11:49 +01:00
Eric Dumazet
c305c6ae79 net: add annotations on hh->hh_len lockless accesses
KCSAN reported a data-race [1]

While we can use READ_ONCE() on the read sides,
we need to make sure hh->hh_len is written last.

[1]

BUG: KCSAN: data-race in eth_header_cache / neigh_resolve_output

write to 0xffff8880b9dedcb8 of 4 bytes by task 29760 on cpu 0:
 eth_header_cache+0xa9/0xd0 net/ethernet/eth.c:247
 neigh_hh_init net/core/neighbour.c:1463 [inline]
 neigh_resolve_output net/core/neighbour.c:1480 [inline]
 neigh_resolve_output+0x415/0x470 net/core/neighbour.c:1470
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
 ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
 rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
 worker_thread+0xa0/0x800 kernel/workqueue.c:2415
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

read to 0xffff8880b9dedcb8 of 4 bytes by task 29572 on cpu 1:
 neigh_resolve_output net/core/neighbour.c:1479 [inline]
 neigh_resolve_output+0x113/0x470 net/core/neighbour.c:1470
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x7a2/0xec0 net/ipv6/ip6_output.c:116
 __ip6_finish_output net/ipv6/ip6_output.c:142 [inline]
 __ip6_finish_output+0x2d7/0x330 net/ipv6/ip6_output.c:127
 ip6_finish_output+0x41/0x160 net/ipv6/ip6_output.c:152
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0xf2/0x280 net/ipv6/ip6_output.c:175
 dst_output include/net/dst.h:436 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ndisc_send_skb+0x459/0x5f0 net/ipv6/ndisc.c:505
 ndisc_send_ns+0x207/0x430 net/ipv6/ndisc.c:647
 rt6_probe_deferred+0x98/0xf0 net/ipv6/route.c:615
 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
 worker_thread+0xa0/0x800 kernel/workqueue.c:2415
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 29572 Comm: kworker/1:4 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events rt6_probe_deferred

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 20:07:30 -08:00
Amit Cohen
3b063ae57b devlink: Add layer 3 generic packet exception traps
Add layer 3 generic packet exception traps that can report trapped
packets and documentation of the traps.

Unlike drop traps, these exception traps also need to inject the packet
to the kernel's receive path. For example, a packet that was trapped due
to unreachable neighbour need to be injected into the kernel so that it
will trigger an ARP request or a neighbour solicitation message.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 19:51:40 -08:00
Amit Cohen
6896cc4d8f devlink: Add layer 3 generic packet traps
Add packet traps that can report packets that were dropped during layer
3 forwarding.

Signed-off-by: Amit Cohen <amitc@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 19:51:40 -08:00
Eric Dumazet
200ecef67b tcp: Remove one extra ktime_get_ns() from cookie_init_timestamp
tcp_make_synack() already uses tcp_clock_ns(), and can pass
the value to cookie_init_timestamp() to avoid another call
to ktime_get_ns() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-07 16:17:52 -08:00
Eric Dumazet
9ed498c628 net: silence data-races on sk_backlog.tail
sk->sk_backlog.tail might be read without holding the socket spinlock,
we need to add proper READ_ONCE()/WRITE_ONCE() to silence the warnings.

KCSAN reported :

BUG: KCSAN: data-race in tcp_add_backlog / tcp_recvmsg

write to 0xffff8881265109f8 of 8 bytes by interrupt on cpu 1:
 __sk_add_backlog include/net/sock.h:907 [inline]
 sk_add_backlog include/net/sock.h:938 [inline]
 tcp_add_backlog+0x476/0xce0 net/ipv4/tcp_ipv4.c:1759
 tcp_v4_rcv+0x1a70/0x1bd0 net/ipv4/tcp_ipv4.c:1947
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:4929
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5043
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5133
 napi_skb_finish net/core/dev.c:5596 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5629
 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
 virtnet_receive drivers/net/virtio_net.c:1323 [inline]
 virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
 napi_poll net/core/dev.c:6311 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6379
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 invoke_softirq kernel/softirq.c:373 [inline]
 irq_exit+0xbb/0xe0 kernel/softirq.c:413
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 do_IRQ+0xa6/0x180 arch/x86/kernel/irq.c:263
 ret_from_intr+0x0/0x19
 native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
 arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
 default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x1af/0x280 kernel/sched/idle.c:263
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
 start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241

read to 0xffff8881265109f8 of 8 bytes by task 8057 on cpu 0:
 tcp_recvmsg+0x46e/0x1b40 net/ipv4/tcp.c:2050
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1889 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414
 __vfs_read+0xb1/0xc0 fs/read_write.c:427
 vfs_read fs/read_write.c:461 [inline]
 vfs_read+0x143/0x2c0 fs/read_write.c:446
 ksys_read+0xd5/0x1b0 fs/read_write.c:587
 __do_sys_read fs/read_write.c:597 [inline]
 __se_sys_read fs/read_write.c:595 [inline]
 __x64_sys_read+0x4c/0x60 fs/read_write.c:595
 do_syscall_64+0xcc/0x370 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 8057 Comm: syz-fuzzer Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:35:34 -08:00
David S. Miller
53ba60afb1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Missing register size validation in bitwise and cmp offloads.

2) Fix error code in ip_set_sockfn_get() when copy_to_user() fails,
   from Dan Carpenter.

3) Oneliner to copy MAC address in IPv6 hash:ip,mac sets, from
   Stefano Brivio.

4) Missing policy validation in ipset with NL_VALIDATE_STRICT,
   from Jozsef Kadlecsik.

5) Fix unaligned access to private data area of nf_tables instructions,
   from Lukas Wunner.

6) Relax check for object updates, reported as a regression by
   Eric Garver, patch from Fernando Fernandez Mancera.

7) Crash on ebtables dnat extension when used from the output path.
   From Florian Westphal.

8) Fix bogus EOPNOTSUPP when updating basechain flags.

9) Fix bogus EBUSY when updating a basechain that is already offloaded.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 21:16:55 -08:00
Jakub Kicinski
79ffe6087e net/tls: add a TX lock
TLS TX needs to release and re-acquire the socket lock if send buffer
fills up.

TLS SW TX path currently depends on only allowing one thread to enter
the function by the abuse of sk_write_pending. If another writer is
already waiting for memory no new ones are allowed in.

This has two problems:
 - writers don't wake other threads up when they leave the kernel;
   meaning that this scheme works for single extra thread (second
   application thread or delayed work) because memory becoming
   available will send a wake up request, but as Mallesham and
   Pooja report with larger number of threads it leads to threads
   being put to sleep indefinitely;
 - the delayed work does not get _scheduled_ but it may _run_ when
   other writers are present leading to crashes as writers don't
   expect state to change under their feet (same records get pushed
   and freed multiple times); it's hard to reliably bail from the
   work, however, because the mere presence of a writer does not
   guarantee that the writer will push pending records before exiting.

Ensuring wakeups always happen will make the code basically open
code a mutex. Just use a mutex.

The TLS HW TX path does not have any locking (not even the
sk_write_pending hack), yet it uses a per-socket sg_tx_data
array to push records.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Reported-by: Mallesham  Jatharakonda <mallesh537@gmail.com>
Reported-by: Pooja Trivedi <poojatrivedi@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 17:33:32 -08:00
Eric Dumazet
099ecf59f0 net: annotate lockless accesses to sk->sk_max_ack_backlog
sk->sk_max_ack_backlog can be read without any lock being held
at least in TCP/DCCP cases.

We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
and/or potential KCSAN warnings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet
288efe8606 net: annotate lockless accesses to sk->sk_ack_backlog
sk->sk_ack_backlog can be read without any lock being held.
We need to use READ_ONCE()/WRITE_ONCE() to avoid load/store tearing
and/or potential KCSAN warnings.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet
25c7a6d1f9 net: avoid potential false sharing in neighbor related code
There are common instances of the following construct :

	if (n->confirmed != now)
		n->confirmed = now;

A C compiler could legally remove the conditional.

Use READ_ONCE()/WRITE_ONCE() to avoid this problem.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-06 16:14:48 -08:00
Eric Dumazet
f75359f3ac net: prevent load/store tearing on sk->sk_stamp
Add a couple of READ_ONCE() and WRITE_ONCE() to prevent
load-tearing and store-tearing in sock_read_timestamp()
and sock_write_timestamp()

This might prevent another KCSAN report.

Fixes: 3a0ed3e961 ("sock: Make sock->sk_stamp thread-safe")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:22:30 -08:00
Eric Dumazet
d0083d98f6 net_sched: extend packet counter to 64bit
After this change, qdisc packet counter is no longer
a 32bit quantity. We still export 32bit values to user.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:20:55 -08:00
Eric Dumazet
4d390c287b net_sched: do not export gnet_stats_basic_packed to uapi
gnet_stats_basic_packed was really meant to be private kernel structure.

If this proves to be a problem, we will have to rename the in-kernel
version.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:20:55 -08:00
Andrew Lunn
5cd73fbd78 net: dsa: Add support for devlink resources
Add wrappers around the devlink resource API, so that DSA drivers can
register and unregister devlink resources.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 18:09:45 -08:00
John Hurley
59eb87cb52 net: sched: prevent duplicate flower rules from tcf_proto destroy race
When a new filter is added to cls_api, the function
tcf_chain_tp_insert_unique() looks up the protocol/priority/chain to
determine if the tcf_proto is duplicated in the chain's hashtable. It then
creates a new entry or continues with an existing one. In cls_flower, this
allows the function fl_ht_insert_unque to determine if a filter is a
duplicate and reject appropriately, meaning that the duplicate will not be
passed to drivers via the offload hooks. However, when a tcf_proto is
destroyed it is removed from its chain before a hardware remove hook is
hit. This can lead to a race whereby the driver has not received the
remove message but duplicate flows can be accepted. This, in turn, can
lead to the offload driver receiving incorrect duplicate flows and out of
order add/delete messages.

Prevent duplicates by utilising an approach suggested by Vlad Buslov. A
hash table per block stores each unique chain/protocol/prio being
destroyed. This entry is only removed when the full destroy (and hardware
offload) has completed. If a new flow is being added with the same
identiers as a tc_proto being detroyed, then the add request is replayed
until the destroy is complete.

Fixes: 8b64678e0a ("net: sched: refactor tp insert/delete for concurrent execution")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reported-by: Louis Peens <louis.peens@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:47:26 -08:00
Jay Vosburgh
1899bb3251 bonding: fix state transition issue in link monitoring
Since de77ecd4ef ("bonding: improve link-status update in
mii-monitoring"), the bonding driver has utilized two separate variables
to indicate the next link state a particular slave should transition to.
Each is used to communicate to a different portion of the link state
change commit logic; one to the bond_miimon_commit function itself, and
another to the state transition logic.

	Unfortunately, the two variables can become unsynchronized,
resulting in incorrect link state transitions within bonding.  This can
cause slaves to become stuck in an incorrect link state until a
subsequent carrier state transition.

	The issue occurs when a special case in bond_slave_netdev_event
sets slave->link directly to BOND_LINK_FAIL.  On the next pass through
bond_miimon_inspect after the slave goes carrier up, the BOND_LINK_FAIL
case will set the proposed next state (link_new_state) to BOND_LINK_UP,
but the new_link to BOND_LINK_DOWN.  The setting of the final link state
from new_link comes after that from link_new_state, and so the slave
will end up incorrectly in _DOWN state.

	Resolve this by combining the two variables into one.

Reported-by: Aleksei Zakharov <zakharov.a.g@yandex.ru>
Reported-by: Sha Zhang <zhangsha.zhang@huawei.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Fixes: de77ecd4ef ("bonding: improve link-status update in mii-monitoring")
Signed-off-by: Jay Vosburgh <jay.vosburgh@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-05 17:40:16 -08:00
Lukas Wunner
250367c59e netfilter: nf_tables: Align nft_expr private data to 64-bit
Invoking the following commands on a 32-bit architecture with strict
alignment requirements (such as an ARMv7-based Raspberry Pi) results
in an alignment exception:

 # nft add table ip test-ip4
 # nft add chain ip test-ip4 output { type filter hook output priority 0; }
 # nft add rule  ip test-ip4 output quota 1025 bytes

Alignment trap: not handling instruction e1b26f9f at [<7f4473f8>]
Unhandled fault: alignment exception (0x001) at 0xb832e824
Internal error: : 1 [#1] PREEMPT SMP ARM
Hardware name: BCM2835
[<7f4473fc>] (nft_quota_do_init [nft_quota])
[<7f447448>] (nft_quota_init [nft_quota])
[<7f4260d0>] (nf_tables_newrule [nf_tables])
[<7f4168dc>] (nfnetlink_rcv_batch [nfnetlink])
[<7f416bd0>] (nfnetlink_rcv [nfnetlink])
[<8078b334>] (netlink_unicast)
[<8078b664>] (netlink_sendmsg)
[<8071b47c>] (sock_sendmsg)
[<8071bd18>] (___sys_sendmsg)
[<8071ce3c>] (__sys_sendmsg)
[<8071ce94>] (sys_sendmsg)

The reason is that nft_quota_do_init() calls atomic64_set() on an
atomic64_t which is only aligned to 32-bit, not 64-bit, because it
succeeds struct nft_expr in memory which only contains a 32-bit pointer.
Fix by aligning the nft_expr private data to 64-bit.

Fixes: 96518518cc ("netfilter: add nftables")
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: stable@vger.kernel.org # v3.13+
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-11-04 20:58:32 +01:00
David S. Miller
ae8a76fb8b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-11-02

The following pull-request contains BPF updates for your *net-next* tree.

We've added 30 non-merge commits during the last 7 day(s) which contain
a total of 41 files changed, 1864 insertions(+), 474 deletions(-).

The main changes are:

1) Fix long standing user vs kernel access issue by introducing
   bpf_probe_read_user() and bpf_probe_read_kernel() helpers, from Daniel.

2) Accelerated xskmap lookup, from Björn and Maciej.

3) Support for automatic map pinning in libbpf, from Toke.

4) Cleanup of BTF-enabled raw tracepoints, from Alexei.

5) Various fixes to libbpf and selftests.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 15:29:58 -07:00
David S. Miller
d31e95585c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
The only slightly tricky merge conflict was the netdevsim because the
mutex locking fix overlapped a lot of driver reload reorganization.

The rest were (relatively) trivial in nature.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-11-02 13:54:56 -07:00
Björn Töpel
d817991cc7 xsk: Restructure/inline XSKMAP lookup/redirect/flush
In this commit the XSKMAP entry lookup function used by the XDP
redirect code is moved from the xskmap.c file to the xdp_sock.h
header, so the lookup can be inlined from, e.g., the
bpf_xdp_redirect_map() function.

Further the __xsk_map_redirect() and __xsk_map_flush() is moved to the
xsk.c, which lets the compiler inline the xsk_rcv() and xsk_flush()
functions.

Finally, all the XDP socket functions were moved from linux/bpf.h to
net/xdp_sock.h, where most of the XDP sockets functions are anyway.

This yields a ~2% performance boost for the xdpsock "rx_drop"
scenario.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Link: https://lore.kernel.org/bpf/20191101110346.15004-4-bjorn.topel@gmail.com
2019-11-02 00:38:49 +01:00
Vivien Didelot
9c8ad1ab66 net: dsa: remove the dst->ds array
Now that the DSA ports are listed in the switch fabric, there is
no need to store the dsa_switch structures from the drivers in the
fabric anymore. So get rid of the dst->ds static array.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot
96252b8e05 net: dsa: remove ds->rtable
Drivers do not use the ds->rtable static arrays anymore, get rid of it.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vivien Didelot
c5f51765a1 net: dsa: list DSA links in the fabric
Implement a new list of DSA links in the switch fabric itself, to
provide an alterative to the ds->rtable static arrays.

At the same time, provide a new dsa_routing_port() helper to abstract
the usage of ds->rtable in drivers. If there's no port to reach a
given device, return the first invalid port, ds->num_ports. This avoids
potential signedness errors or the need to define special values.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-31 14:26:38 -07:00
Vlad Buslov
e382267860 net: sched: update action implementations to support flags
Extend struct tc_action with new "tcfa_flags" field. Set the field in
tcf_idr_create() function and provide new helper
tcf_idr_create_from_flags() that derives 'cpustats' boolean from flags
value. Update individual hardware-offloaded actions init() to pass their
"flags" argument to new helper in order to skip percpu stats allocation
when user requested it through flags.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:51 -07:00
Vlad Buslov
abbb0d3363 net: sched: extend TCA_ACT space with TCA_ACT_FLAGS
Extend TCA_ACT space with nla_bitfield32 flags. Add
TCA_ACT_FLAGS_NO_PERCPU_STATS as the only allowed flag. Parse the flags in
tcf_action_init_1() and pass resulting value as additional argument to
a_o->init().

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov
5e174d5e73 net: sched: modify stats helper functions to support regular stats
Modify stats update helper functions introduced in previous patches in this
series to fallback to regular tc_action->tcfa_{b|q}stats if cpu stats are
not allocated for the action argument. If regular non-percpu allocated
counters are in use, then obtain action tcfa_lock while modifying them.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov
ef816f3c49 net: sched: don't expose action qstats to skb_tc_reinsert()
Previous commit introduced helper function for updating qstats and
refactored set of actions to use the helpers, instead of modifying qstats
directly. However, one of the affected action exposes its qstats to
skb_tc_reinsert(), which then modifies it.

Refactor skb_tc_reinsert() to return integer error code and don't increment
overlimit qstats in case of error, and use the returned error code in
tcf_mirred_act() to manually increment the overlimit counter with new
helper function.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov
26b537a88c net: sched: extract qstats update code into functions
Extract common code that increments cpu_qstats counters into standalone act
API functions. Change hardware offloaded actions that use percpu counter
allocation to use the new functions instead of accessing cpu_qstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov
5e1ad95b63 net: sched: extract bstats update code into function
Extract common code that increments cpu_bstats counter into standalone act
API function. Change hardware offloaded actions that use percpu counter
allocation to use the new function instead of incrementing cpu_bstats
directly.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Vlad Buslov
c8ecebd04c net: sched: extract common action counters update code into function
Currently, all implementations of tc_action_ops->stats_update() callback
have almost exactly the same implementation of counters update
code (besides gact which also updates drop counter). In order to simplify
support for using both percpu-allocated and regular action counters
depending on run-time flag in following patches, extract action counters
update code into standalone function in act API.

This commit doesn't change functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 18:07:50 -07:00
Eric Dumazet
ee8d153d46 net: annotate lockless accesses to sk->sk_napi_id
We already annotated most accesses to sk->sk_napi_id

We missed sk_mark_napi_id() and sk_mark_napi_id_once()
which might be called without socket lock held in UDP stack.

KCSAN reported :
BUG: KCSAN: data-race in udpv6_queue_rcv_one_skb / udpv6_queue_rcv_one_skb

write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 0:
 sk_mark_napi_id include/net/busy_poll.h:125 [inline]
 __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
 udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
 udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
 udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
 __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
 udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
 ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
 ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
 dst_input include/net/dst.h:442 [inline]
 ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460

write to 0xffff888121c6d108 of 4 bytes by interrupt on cpu 1:
 sk_mark_napi_id include/net/busy_poll.h:125 [inline]
 __udpv6_queue_rcv_skb net/ipv6/udp.c:571 [inline]
 udpv6_queue_rcv_one_skb+0x70c/0xb40 net/ipv6/udp.c:672
 udpv6_queue_rcv_skb+0xb5/0x400 net/ipv6/udp.c:689
 udp6_unicast_rcv_skb.isra.0+0xd7/0x180 net/ipv6/udp.c:832
 __udp6_lib_rcv+0x69c/0x1770 net/ipv6/udp.c:913
 udpv6_rcv+0x2b/0x40 net/ipv6/udp.c:1015
 ip6_protocol_deliver_rcu+0x22a/0xbe0 net/ipv6/ip6_input.c:409
 ip6_input_finish+0x30/0x50 net/ipv6/ip6_input.c:450
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip6_input+0x177/0x190 net/ipv6/ip6_input.c:459
 dst_input include/net/dst.h:442 [inline]
 ip6_rcv_finish+0x110/0x140 net/ipv6/ip6_input.c:76
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ipv6_rcv+0x1a1/0x1b0 net/ipv6/ip6_input.c:284
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 10890 Comm: syz-executor.0 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Fixes: e68b6e50fa ("udp: enable busy polling for all sockets")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:34:35 -07:00
Matteo Croce
5dec597e5c flow_dissector: extract more ICMP information
The ICMP flow dissector currently parses only the Type and Code fields.
Some ICMP packets (echo, timestamp) have a 16 bit Identifier field which
is used to correlate packets.
Add such field in flow_dissector_key_icmp and replace skb_flow_get_be16()
with a more complex function which populate this field.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:21:35 -07:00
Matteo Croce
98298e6ca6 flow_dissector: add meaningful comments
Documents two piece of code which can't be understood at a glance.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 17:21:35 -07:00
Eric Dumazet
7170a97774 net: annotate accesses to sk->sk_incoming_cpu
This socket field can be read and written by concurrent cpus.

Use READ_ONCE() and WRITE_ONCE() annotations to document this,
and avoid some compiler 'optimizations'.

KCSAN reported :

BUG: KCSAN: data-race in tcp_v4_rcv / tcp_v4_rcv

write to 0xffff88812220763c of 4 bytes by interrupt on cpu 0:
 sk_incoming_cpu_update include/net/sock.h:953 [inline]
 tcp_v4_rcv+0x1b3c/0x1bb0 net/ipv4/tcp_ipv4.c:1934
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082
 do_softirq.part.0+0x6b/0x80 kernel/softirq.c:337
 do_softirq kernel/softirq.c:329 [inline]
 __local_bh_enable_ip+0x76/0x80 kernel/softirq.c:189

read to 0xffff88812220763c of 4 bytes by interrupt on cpu 1:
 sk_incoming_cpu_update include/net/sock.h:952 [inline]
 tcp_v4_rcv+0x181a/0x1bb0 net/ipv4/tcp_ipv4.c:1934
 ip_protocol_deliver_rcu+0x4d/0x420 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5010
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5124
 process_backlog+0x1d3/0x420 net/core/dev.c:5955
 napi_poll net/core/dev.c:6392 [inline]
 net_rx_action+0x3ae/0xa90 net/core/dev.c:6460
 __do_softirq+0x115/0x33f kernel/softirq.c:292
 run_ksoftirqd+0x46/0x60 kernel/softirq.c:603
 smpboot_thread_fn+0x37d/0x4a0 kernel/smpboot.c:165

Reported by Kernel Concurrency Sanitizer on:
CPU: 1 PID: 16 Comm: ksoftirqd/1 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-30 13:24:25 -07:00
Ursula Braun
8466a57dfb net/smc: remove unneeded include for smc.h
The only smc-related reference in net/sock.h is struct smc_hashinfo.
But just its address is refered to. Thus there is no need for the
include of net/smc.h. Remove it.

Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 18:11:15 -07:00
Vivien Didelot
d607525bd9 net: dsa: return directly from dsa_to_port
Return directly from within the loop as soon as the port is found,
otherwise we won't return NULL if the end of the list is reached.

Fixes: b96ddf254b ("net: dsa: use ports list in dsa_to_port")
Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-29 12:07:49 -07:00
Andrew Lunn
6b29752423 net: dsa: Add support for devlink device parameters
Add plumbing to allow DSA drivers to register parameters with devlink.

To keep with the abstraction, the DSA drivers pass the ds structure to
these helpers, and the DSA core then translates that to the devlink
structure associated to the device.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28 16:21:02 -07:00
Tejun Heo
20eb4f29b6 net: fix sk_page_frag() recursion from memory reclaim
sk_page_frag() optimizes skb_frag allocations by using per-task
skb_frag cache when it knows it's the only user.  The condition is
determined by seeing whether the socket allocation mask allows
blocking - if the allocation may block, it obviously owns the task's
context and ergo exclusively owns current->task_frag.

Unfortunately, this misses recursion through memory reclaim path.
Please take a look at the following backtrace.

 [2] RIP: 0010:tcp_sendmsg_locked+0xccf/0xe10
     ...
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     sock_xmit.isra.24+0xa1/0x170 [nbd]
     nbd_send_cmd+0x1d2/0x690 [nbd]
     nbd_queue_rq+0x1b5/0x3b0 [nbd]
     __blk_mq_try_issue_directly+0x108/0x1b0
     blk_mq_request_issue_directly+0xbd/0xe0
     blk_mq_try_issue_list_directly+0x41/0xb0
     blk_mq_sched_insert_requests+0xa2/0xe0
     blk_mq_flush_plug_list+0x205/0x2a0
     blk_flush_plug_list+0xc3/0xf0
 [1] blk_finish_plug+0x21/0x2e
     _xfs_buf_ioapply+0x313/0x460
     __xfs_buf_submit+0x67/0x220
     xfs_buf_read_map+0x113/0x1a0
     xfs_trans_read_buf_map+0xbf/0x330
     xfs_btree_read_buf_block.constprop.42+0x95/0xd0
     xfs_btree_lookup_get_block+0x95/0x170
     xfs_btree_lookup+0xcc/0x470
     xfs_bmap_del_extent_real+0x254/0x9a0
     __xfs_bunmapi+0x45c/0xab0
     xfs_bunmapi+0x15/0x30
     xfs_itruncate_extents_flags+0xca/0x250
     xfs_free_eofblocks+0x181/0x1e0
     xfs_fs_destroy_inode+0xa8/0x1b0
     destroy_inode+0x38/0x70
     dispose_list+0x35/0x50
     prune_icache_sb+0x52/0x70
     super_cache_scan+0x120/0x1a0
     do_shrink_slab+0x120/0x290
     shrink_slab+0x216/0x2b0
     shrink_node+0x1b6/0x4a0
     do_try_to_free_pages+0xc6/0x370
     try_to_free_mem_cgroup_pages+0xe3/0x1e0
     try_charge+0x29e/0x790
     mem_cgroup_charge_skmem+0x6a/0x100
     __sk_mem_raise_allocated+0x18e/0x390
     __sk_mem_schedule+0x2a/0x40
 [0] tcp_sendmsg_locked+0x8eb/0xe10
     tcp_sendmsg+0x27/0x40
     sock_sendmsg+0x30/0x40
     ___sys_sendmsg+0x26d/0x2b0
     __sys_sendmsg+0x57/0xa0
     do_syscall_64+0x42/0x100
     entry_SYSCALL_64_after_hwframe+0x44/0xa9

In [0], tcp_send_msg_locked() was using current->page_frag when it
called sk_wmem_schedule().  It already calculated how many bytes can
be fit into current->page_frag.  Due to memory pressure,
sk_wmem_schedule() called into memory reclaim path which called into
xfs and then IO issue path.  Because the filesystem in question is
backed by nbd, the control goes back into the tcp layer - back into
tcp_sendmsg_locked().

nbd sets sk_allocation to (GFP_NOIO | __GFP_MEMALLOC) which makes
sense - it's in the process of freeing memory and wants to be able to,
e.g., drop clean pages to make forward progress.  However, this
confused sk_page_frag() called from [2].  Because it only tests
whether the allocation allows blocking which it does, it now thinks
current->page_frag can be used again although it already was being
used in [0].

After [2] used current->page_frag, the offset would be increased by
the used amount.  When the control returns to [0],
current->page_frag's offset is increased and the previously calculated
number of bytes now may overrun the end of allocated memory leading to
silent memory corruptions.

Fix it by adding gfpflags_normal_context() which tests sleepable &&
!reclaim and use it to determine whether to use current->task_frag.

v2: Eric didn't like gfp flags being tested twice.  Introduce a new
    helper gfpflags_normal_context() and combine the two tests.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28 16:17:31 -07:00
Geert Uytterhoeven
e1b185491f net: Fix various misspellings of "connect"
Fix misspellings of "disconnect", "disconnecting", "connections", and
"disconnected".

Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Acked-by: Kalle Valo <kvalo@codeaurora.org>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-28 13:41:59 -07:00
Joe Perches
3f2aef10ff mac80211: fix a typo of "function"
Signed-off-by: Joe Perches <joe@perches.com>
Link: https://lore.kernel.org/r/4d53be6c963542878d370ff1a6dc7c3a89b28d23.camel@perches.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-28 13:14:37 +01:00
Chris Packham
ae4a50ee31 mac80211: typo fixes in kerneldoc comments
Correct some trivial typos in kerneldoc comments.

Signed-off-by: Chris Packham <chris.packham@alliedtelesis.co.nz>
Link: https://lore.kernel.org/r/20191024213647.5507-1-chris.packham@alliedtelesis.co.nz
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-28 13:13:35 +01:00
David S. Miller
fc11078dd3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for net:

1) Fix crash on flowtable due to race between garbage collection
   and insertion.

2) Restore callback unbinding in netfilter offloads.

3) Fix races on IPVS module removal, from Davide Caratti.

4) Make old_secure_tcp per-netns to fix sysbot report,
   from Eric Dumazet.

5) Validate matching length in netfilter offloads, from wenxu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-27 12:13:16 -07:00
David S. Miller
4b1f5ddaff Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next,
more specifically:

* Updates for ipset:

1) Coding style fix for ipset comment extension, from Jeremy Sowden.

2) De-inline many functions in ipset, from Jeremy Sowden.

3) Move ipset function definition from header to source file.

4) Move ip_set_put_flags() to source, export it as a symbol, remove
   inline.

5) Move range_to_mask() to the source file where this is used.

6) Move ip_set_get_ip_port() to the source file where this is used.

* IPVS selftests and netns improvements:

7) Two patches to speedup ipvs netns dismantle, from Haishuang Yan.

8) Three patches to add selftest script for ipvs, also from
   Haishuang Yan.

* Conntrack updates and new nf_hook_slow_list() function:

9) Document ct ecache extension, from Florian Westphal.

10) Skip ct extensions from ctnetlink dump, from Florian.

11) Free ct extension immediately, from Florian.

12) Skip access to ecache extension from nf_ct_deliver_cached_events()
    this is not correct as reported by Syzbot.

13) Add and use nf_hook_slow_list(), from Florian.

* Flowtable infrastructure updates:

14) Move priority to nf_flowtable definition.

15) Dynamic allocation of per-device hooks in flowtables.

16) Allow to include netdevice only once in flowtable definitions.

17) Rise maximum number of devices per flowtable.

* Netfilter hardware offload infrastructure updates:

18) Add nft_flow_block_chain() helper function.

19) Pass callback list to nft_setup_cb_call().

20) Add nft_flow_cls_offload_setup() helper function.

21) Remove rules for the unregistered device via netdevice event.

22) Support for multiple devices in a basechain definition at the
    ingress hook.

22) Add nft_chain_offload_cmd() helper function.

23) Add nft_flow_block_offload_init() helper function.

24) Rewind in case of failing to bind multiple devices to hook.

25) Typo in IPv6 tproxy module description, from Norman Rasmussen.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-26 11:35:43 -07:00
Guillaume Nault
d4e4fdf9e4 netns: fix GFP flags in rtnl_net_notifyid()
In rtnl_net_notifyid(), we certainly can't pass a null GFP flag to
rtnl_notify(). A GFP_KERNEL flag would be fine in most circumstances,
but there are a few paths calling rtnl_net_notifyid() from atomic
context or from RCU critical sections. The later also precludes the use
of gfp_any() as it wouldn't detect the RCU case. Also, the nlmsg_new()
call is wrong too, as it uses GFP_KERNEL unconditionally.

Therefore, we need to pass the GFP flags as parameter and propagate it
through function calls until the proper flags can be determined.

In most cases, GFP_KERNEL is fine. The exceptions are:
  * openvswitch: ovs_vport_cmd_get() and ovs_vport_cmd_dump()
    indirectly call rtnl_net_notifyid() from RCU critical section,

  * rtnetlink: rtmsg_ifinfo_build_skb() already receives GFP flags as
    parameter.

Also, in ovs_vport_cmd_build_info(), let's change the GFP flags used
by nlmsg_new(). The function is allowed to sleep, so better make the
flags consistent with the ones used in the following
ovs_vport_cmd_fill_info() call.

Found by code inspection.

Fixes: 9a9634545c ("netns: notify netns id events")
Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25 20:14:42 -07:00
Ben Dooks (Codethink)
91e2e57636 net: hwbm: if CONFIG_NET_HWBM unset, make stub functions static
If CONFIG_NET_HWBM is not set, then these stub functions in
<net/hwbm.h> should be declared static to avoid trying to
export them from any driver that includes this.

Fixes the following sparse warnings:

./include/net/hwbm.h:24:6: warning: symbol 'hwbm_buf_free' was not declared. Should it be static?
./include/net/hwbm.h:25:5: warning: symbol 'hwbm_pool_refill' was not declared. Should it be static?
./include/net/hwbm.h:26:5: warning: symbol 'hwbm_pool_add' was not declared. Should it be static?

Signed-off-by: Ben Dooks (Codethink) <ben.dooks@codethink.co.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-25 16:24:32 -07:00
Taehee Yoo
f3b0a18bb6 net: remove unnecessary variables and callback
This patch removes variables and callback these are related to the nested
device structure.
devices that can be nested have their own nest_level variable that
represents the depth of nested devices.
In the previous patch, new {lower/upper}_level variables are added and
they replace old private nest_level variable.
So, this patch removes all 'nest_level' variables.

In order to avoid lockdep warning, ->ndo_get_lock_subclass() was added
to get lockdep subclass value, which is actually lower nested depth value.
But now, they use the dynamic lockdep key to avoid lockdep warning instead
of the subclass.
So, this patch removes ->ndo_get_lock_subclass() callback.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:49 -07:00
Taehee Yoo
0ce1822c2a vxlan: add adjacent link to limit depth level
Current vxlan code doesn't limit the number of nested devices.
Nested devices would be handled recursively and this routine needs
huge stack memory. So, unlimited nested devices could make
stack overflow.

In order to fix this issue, this patch adds adjacent links.
The adjacent link APIs internally check the depth level.

Test commands:
    ip link add dummy0 type dummy
    ip link add vxlan0 type vxlan id 0 group 239.1.1.1 dev dummy0 \
	    dstport 4789
    for i in {1..100}
    do
	    let A=$i-1
	    ip link add vxlan$i type vxlan id $i group 239.1.1.1 \
		    dev vxlan$A dstport 4789
    done
    ip link del dummy0

The top upper link is vxlan100 and the lowest link is vxlan0.
When vxlan0 is deleting, the upper devices will be deleted recursively.
It needs huge stack memory so it makes stack overflow.

Splat looks like:
[  229.628477] =============================================================================
[  229.629785] BUG page->ptl (Not tainted): Padding overwritten. 0x0000000026abf214-0x0000000091f6abb2
[  229.629785] -----------------------------------------------------------------------------
[  229.629785]
[  229.655439] ==================================================================
[  229.629785] INFO: Slab 0x00000000ff7cfda8 objects=19 used=19 fp=0x00000000fe33776c flags=0x200000000010200
[  229.655688] BUG: KASAN: stack-out-of-bounds in unmap_single_vma+0x25a/0x2e0
[  229.655688] Read of size 8 at addr ffff888113076928 by task vlan-network-in/2334
[  229.655688]
[  229.629785] Padding 0000000026abf214: 00 80 14 0d 81 88 ff ff 68 91 81 14 81 88 ff ff  ........h.......
[  229.629785] Padding 0000000001e24790: 38 91 81 14 81 88 ff ff 68 91 81 14 81 88 ff ff  8.......h.......
[  229.629785] Padding 00000000b39397c8: 33 30 62 a7 ff ff ff ff ff eb 60 22 10 f1 ff 1f  30b.......`"....
[  229.629785] Padding 00000000bc98f53a: 80 60 07 13 81 88 ff ff 00 80 14 0d 81 88 ff ff  .`..............
[  229.629785] Padding 000000002aa8123d: 68 91 81 14 81 88 ff ff f7 21 17 a7 ff ff ff ff  h........!......
[  229.629785] Padding 000000001c8c2369: 08 81 14 0d 81 88 ff ff 03 02 00 00 00 00 00 00  ................
[  229.629785] Padding 000000004e290c5d: 21 90 a2 21 10 ed ff ff 00 00 00 00 00 fc ff df  !..!............
[  229.629785] Padding 000000000e25d731: 18 60 07 13 81 88 ff ff c0 8b 13 05 81 88 ff ff  .`..............
[  229.629785] Padding 000000007adc7ab3: b3 8a b5 41 00 00 00 00                          ...A....
[  229.629785] FIX page->ptl: Restoring 0x0000000026abf214-0x0000000091f6abb2=0x5a
[  ... ]

Fixes: acaf4e7099 ("net: vxlan: when lower dev unregisters remove vxlan dev as well")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:49 -07:00
Taehee Yoo
089bca2cae bonding: use dynamic lockdep key instead of subclass
All bonding device has same lockdep key and subclass is initialized with
nest_level.
But actual nest_level value can be changed when a lower device is attached.
And at this moment, the subclass should be updated but it seems to be
unsafe.
So this patch makes bonding use dynamic lockdep key instead of the
subclass.

Test commands:
    ip link add bond0 type bond

    for i in {1..5}
    do
	    let A=$i-1
	    ip link add bond$i type bond
	    ip link set bond$i master bond$A
    done
    ip link set bond5 master bond0

Splat looks like:
[  307.992912] WARNING: possible recursive locking detected
[  307.993656] 5.4.0-rc3+ #96 Tainted: G        W
[  307.994367] --------------------------------------------
[  307.995092] ip/761 is trying to acquire lock:
[  307.995710] ffff8880513aac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  307.997045]
	       but task is already holding lock:
[  307.997923] ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  307.999215]
	       other info that might help us debug this:
[  308.000251]  Possible unsafe locking scenario:

[  308.001137]        CPU0
[  308.001533]        ----
[  308.001915]   lock(&(&bond->stats_lock)->rlock#2/2);
[  308.002609]   lock(&(&bond->stats_lock)->rlock#2/2);
[  308.003302]
		*** DEADLOCK ***

[  308.004310]  May be due to missing lock nesting notation

[  308.005319] 3 locks held by ip/761:
[  308.005830]  #0: ffffffff9fcc42b0 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x466/0x8a0
[  308.006894]  #1: ffff88805fcbac60 (&(&bond->stats_lock)->rlock#2/2){+.+.}, at: bond_get_stats+0xb8/0x500 [bonding]
[  308.008243]  #2: ffffffff9f9219c0 (rcu_read_lock){....}, at: bond_get_stats+0x9f/0x500 [bonding]
[  308.009422]
	       stack backtrace:
[  308.010124] CPU: 0 PID: 761 Comm: ip Tainted: G        W         5.4.0-rc3+ #96
[  308.011097] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  308.012179] Call Trace:
[  308.012601]  dump_stack+0x7c/0xbb
[  308.013089]  __lock_acquire+0x269d/0x3de0
[  308.013669]  ? register_lock_class+0x14d0/0x14d0
[  308.014318]  lock_acquire+0x164/0x3b0
[  308.014858]  ? bond_get_stats+0xb8/0x500 [bonding]
[  308.015520]  _raw_spin_lock_nested+0x2e/0x60
[  308.016129]  ? bond_get_stats+0xb8/0x500 [bonding]
[  308.017215]  bond_get_stats+0xb8/0x500 [bonding]
[  308.018454]  ? bond_arp_rcv+0xf10/0xf10 [bonding]
[  308.019710]  ? rcu_read_lock_held+0x90/0xa0
[  308.020605]  ? rcu_read_lock_sched_held+0xc0/0xc0
[  308.021286]  ? bond_get_stats+0x9f/0x500 [bonding]
[  308.021953]  dev_get_stats+0x1ec/0x270
[  308.022508]  bond_get_stats+0x1d1/0x500 [bonding]

Fixes: d3fff6c443 ("net: add netdev_lockdep_set_classes() helper")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-24 14:53:48 -07:00
Eric Dumazet
c24b75e0f9 ipvs: move old_secure_tcp into struct netns_ipvs
syzbot reported the following issue :

BUG: KCSAN: data-race in update_defense_level / update_defense_level

read to 0xffffffff861a6260 of 4 bytes by task 3006 on cpu 1:
 update_defense_level+0x621/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:177
 defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
 worker_thread+0xa0/0x800 kernel/workqueue.c:2415
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

write to 0xffffffff861a6260 of 4 bytes by task 7333 on cpu 0:
 update_defense_level+0xa62/0xb30 net/netfilter/ipvs/ip_vs_ctl.c:205
 defense_work_handler+0x3d/0xd0 net/netfilter/ipvs/ip_vs_ctl.c:225
 process_one_work+0x3d4/0x890 kernel/workqueue.c:2269
 worker_thread+0xa0/0x800 kernel/workqueue.c:2415
 kthread+0x1d4/0x200 drivers/block/aoe/aoecmd.c:1253
 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:352

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 7333 Comm: kworker/0:5 Not tainted 5.4.0-rc3+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events defense_work_handler

Indeed, old_secure_tcp is currently a static variable, while it
needs to be a per netns variable.

Fixes: a0840e2e16 ("IPVS: netns, ip_vs_ctl local vars moved to ipvs struct.")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Simon Horman <horms@verge.net.au>
2019-10-24 11:56:02 +02:00
Eric Dumazet
55667441c8 net/flow_dissector: switch to siphash
UDP IPv6 packets auto flowlabels are using a 32bit secret
(static u32 hashrnd in net/core/flow_dissector.c) and
apply jhash() over fields known by the receivers.

Attackers can easily infer the 32bit secret and use this information
to identify a device and/or user, since this 32bit secret is only
set at boot time.

Really, using jhash() to generate cookies sent on the wire
is a serious security concern.

Trying to change the rol32(hash, 16) in ip6_make_flowlabel() would be
a dead end. Trying to periodically change the secret (like in sch_sfq.c)
could change paths taken in the network for long lived flows.

Let's switch to siphash, as we did in commit df453700e8
("inet: switch IP ID generator to siphash")

Using a cryptographically strong pseudo random function will solve this
privacy issue and more generally remove other weak points in the stack.

Packet schedulers using skb_get_hash_perturb() benefit from this change.

Fixes: b56774163f ("ipv6: Enable auto flow labels by default")
Fixes: 42240901f7 ("ipv6: Implement different admin modes for automatic flow labels")
Fixes: 67800f9b1f ("ipv6: Call skb_get_hash_flowi6 to get skb->hash in ip6_make_flowlabel")
Fixes: cb1ce2ef38 ("ipv6: Implement automatic flow label generation on transmit")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Berger <jonathann1@walla.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-23 20:13:22 -07:00
Pablo Neira Ayuso
d54725cd11 netfilter: nf_tables: support for multiple devices per netdev hook
This patch allows you to register one netdev basechain to multiple
devices. This adds a new NFTA_HOOK_DEVS netlink attribute to specify
the list of netdevices. Basechains store a list of hooks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23 13:01:34 +02:00
Pablo Neira Ayuso
cb662ac671 netfilter: nf_tables: increase maximum devices number per flowtable
Rise the maximum limit of devices per flowtable up to 256. Rename
NFT_FLOWTABLE_DEVICE_MAX to NFT_NETDEVICE_MAX in preparation to reuse
the netdev hook parser for ingress basechain.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23 13:01:26 +02:00
Pablo Neira Ayuso
3f0465a9ef netfilter: nf_tables: dynamically allocate hooks per net_device in flowtables
Use a list of hooks per device instead an array.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23 13:01:22 +02:00
Pablo Neira Ayuso
71a8a63b9d netfilter: nf_flow_table: move priority to struct nf_flowtable
Hardware offload needs access to the priority field, store this field in
the nf_flowtable object.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-23 13:01:03 +02:00
Vivien Didelot
7e99e34701 net: dsa: remove dsa_switch_alloc helper
Now that ports are dynamically listed in the fabric, there is no need
to provide a special helper to allocate the dsa_switch structure. This
will give more flexibility to drivers to embed this structure as they
wish in their private structure.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:07 -07:00
Vivien Didelot
05f294a852 net: dsa: allocate ports on touch
Allocate the struct dsa_port the first time it is accessed with
dsa_port_touch, and remove the static dsa_port array from the
dsa_switch structure.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:07 -07:00
Vivien Didelot
da4561cda2 net: dsa: use ports list to setup default CPU port
Use the new ports list instead of iterating over switches and their
ports when setting up the default CPU port. Unassign it on teardown.

Now that we can iterate over multiple CPU ports, remove dst->cpu_dp.

At the same time, provide a better error message for CPU-less tree.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:07 -07:00
Vivien Didelot
fb35c60cba net: dsa: use ports list to setup switches
Use the new ports list instead of iterating over switches and their
ports when setting up the switches and their ports.

At the same time, provide setup states and messages for ports and
switches as it is done for the trees.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:06 -07:00
Vivien Didelot
b96ddf254b net: dsa: use ports list in dsa_to_port
Use the new ports list instead of accessing the dsa_switch array
of ports in the dsa_to_port helper.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:06 -07:00
Vivien Didelot
ab8ccae122 net: dsa: add ports list in the switch fabric
Add a list of switch ports within the switch fabric. This will help the
lookup of a port inside the whole fabric, and it is the first step
towards supporting multiple CPU ports, before deprecating the usage of
the unique dst->cpu_dp pointer.

In preparation for a future allocation of the dsa_port structures,
return -ENOMEM in case no structure is returned, even though this
error cannot be reached yet.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:06 -07:00
Vivien Didelot
68bb8ea8ad net: dsa: use dsa_to_port helper everywhere
Do not let the drivers access the ds->ports static array directly
while there is a dsa_to_port helper for this purpose.

At the same time, un-const this helper since the SJA1105 driver
assigns the priv member of the returned dsa_port structure.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-22 12:37:06 -07:00
Eric Dumazet
e7a409c3f4 ipv4: fix IPSKB_FRAG_PMTU handling with fragmentation
This patch removes the iph field from the state structure, which is not
properly initialized. Instead, add a new field to make the "do we want
to set DF" be the state bit and move the code to set the DF flag from
ip_frag_next().

Joint work with Pablo and Linus.

Fixes: 19c3401a91 ("net: ipv4: place control buffer handling away from fragmentation iterators")
Reported-by: Patrick Schönthaler <patrick@notvads.ovh>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-21 10:46:42 -07:00
David S. Miller
2f184393e0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Several cases of overlapping changes which were for the most
part trivially resolvable.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-20 10:43:00 -07:00
Eric Dumazet
2a06b8982f net: reorder 'struct net' fields to avoid false sharing
Intel test robot reported a ~7% regression on TCP_CRR tests
that they bisected to the cited commit.

Indeed, every time a new TCP socket is created or deleted,
the atomic counter net->count is touched (via get_net(net)
and put_net(net) calls)

So cpus might have to reload a contended cache line in
net_hash_mix(net) calls.

We need to reorder 'struct net' fields to move @hash_mix
in a read mostly cache line.

We move in the first cache line fields that can be
dirtied often.

We probably will have to address in a followup patch
the __randomize_layout that was added in linux-4.13,
since this might break our placement choices.

Fixes: 355b985537 ("netns: provide pure entropy for net_hash_mix()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-19 12:21:53 -07:00
Florian Westphal
2ad9d7747c netfilter: conntrack: free extension area immediately
Instead of waiting for rcu grace period just free it directly.

This is safe because conntrack lookup doesn't consider extensions.

Other accesses happen while ct->ext can't be free'd, either because
a ct refcount was taken or because the conntrack hash bucket lock or
the dying list spinlock have been taken.

This allows to remove __krealloc in a followup patch, netfilter was the
only user.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-10-17 11:47:02 +02:00
David S. Miller
7e0d15ee0d A few more small things, nothing really stands out:
* minstrel improvements from Felix
  * a TX aggregation simplification
  * some additional capabilities for hwsim
  * minor cleanups & docs updates
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl2gQp8ACgkQB8qZga/f
 l8T5kA//Yo07t93OespsNwJZXWx7l/WWBtIydnTAk9hNXDV4kd6oLgN0oadqpx7g
 5bryCRqmS4vx2IjiEQiSK08JqmhpruQSXYe8oixVVCOknw2pfkK6lr+OHCqJO11A
 iWu5Nz7bTl4pfSO0XIfWk1STUjNuXWCTUgbLSwU4oaoqk8oib2VeV1QdXX0hvgXF
 gSlToWQqliI/c6HS69iUJGRqXZCMO7GPWE9Sqj8cvmeAFXWQz9zcan6Fcd2XJyLq
 qJxNbxGD0JQ6vdbg2bFnio8PlwYMJ7ohrRDds8euYzViVtyTVZ6WtD9/gKB6UGVe
 RS5NEsmZLISCrQbV8nK/q0G/mBdNNegj4ezUkWxMvuYDEvl83Xniyz5CoAC++9mp
 0M0//+NgwoVqDvaoV0s+TZBYv5arJyeUCY9kkmPCFFVV6cvmXfRFpn9yU95he2Eb
 duY5P+uKNlFU+sYVh1d6QC26mEAIa0y4qZszp3HurVWXe/aG/fLumW2USAOdqDOw
 9HF9vOqGc3FRZTX1l15F+5nPn9gMyMJJGqOeT4oS1mQJT/KdzQCGLmhQ+IR+00Un
 zF6QsfCCtbuO5xLErqoARa7qKzddDxgkEBbdmQmjUwdyzAxSxZxGDBLLcpZ0OQwo
 Kxx7ELz97f55unLbByDrFMoZvEXaCeGcbZeTJWGvDRElw/BhRJU=
 =IJ7D
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-net-next-2019-10-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
A few more small things, nothing really stands out:
 * minstrel improvements from Felix
 * a TX aggregation simplification
 * some additional capabilities for hwsim
 * minor cleanups & docs updates
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 11:29:07 -07:00
Eric Dumazet
ab4e846a82 tcp: annotate sk->sk_wmem_queued lockless reads
For the sake of tcp_poll(), there are few places where we fetch
sk->sk_wmem_queued while this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make sure write
sides use corresponding WRITE_ONCE() to avoid store-tearing.

sk_wmem_queued_add() helper is added so that we can in
the future convert to ADD_ONCE() or equivalent if/when
available.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet
e292f05e0d tcp: annotate sk->sk_sndbuf lockless reads
For the sake of tcp_poll(), there are few places where we fetch
sk->sk_sndbuf while this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make sure write
sides use corresponding WRITE_ONCE() to avoid store-tearing.

Note that other transports probably need similar fixes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet
ebb3b78db7 tcp: annotate sk->sk_rcvbuf lockless reads
For the sake of tcp_poll(), there are few places where we fetch
sk->sk_rcvbuf while this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make sure write
sides use corresponding WRITE_ONCE() to avoid store-tearing.

Note that other transports probably need similar fixes.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet
e0d694d638 tcp: annotate tp->snd_nxt lockless reads
There are few places where we fetch tp->snd_nxt while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Eric Dumazet
0f31746452 tcp: annotate tp->write_seq lockless reads
There are few places where we fetch tp->write_seq while
this field can change from IRQ or other cpu.

We need to add READ_ONCE() annotations, and also make
sure write sides use corresponding WRITE_ONCE() to avoid
store-tearing.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-13 10:13:08 -07:00
Jiri Pirko
e7a981050a devlink: propagate extack down to health reporter ops
During health reporter operations, driver might want to fill-up
the extack message, so propagate extack down to the health reporter ops.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-11 21:02:30 -07:00
Eric Dumazet
70c2655849 net: silence KCSAN warnings about sk->sk_backlog.len reads
sk->sk_backlog.len can be written by BH handlers, and read
from process contexts in a lockless way.

Note the write side should also use WRITE_ONCE() or a variant.
We need some agreement about the best way to do this.

syzbot reported :

BUG: KCSAN: data-race in tcp_add_backlog / tcp_grow_window.isra.0

write to 0xffff88812665f32c of 4 bytes by interrupt on cpu 1:
 sk_add_backlog include/net/sock.h:934 [inline]
 tcp_add_backlog+0x4a0/0xcc0 net/ipv4/tcp_ipv4.c:1737
 tcp_v4_rcv+0x1aba/0x1bf0 net/ipv4/tcp_ipv4.c:1925
 ip_protocol_deliver_rcu+0x51/0x470 net/ipv4/ip_input.c:204
 ip_local_deliver_finish+0x110/0x140 net/ipv4/ip_input.c:231
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_local_deliver+0x133/0x210 net/ipv4/ip_input.c:252
 dst_input include/net/dst.h:442 [inline]
 ip_rcv_finish+0x121/0x160 net/ipv4/ip_input.c:413
 NF_HOOK include/linux/netfilter.h:305 [inline]
 NF_HOOK include/linux/netfilter.h:299 [inline]
 ip_rcv+0x18f/0x1a0 net/ipv4/ip_input.c:523
 __netif_receive_skb_one_core+0xa7/0xe0 net/core/dev.c:5004
 __netif_receive_skb+0x37/0xf0 net/core/dev.c:5118
 netif_receive_skb_internal+0x59/0x190 net/core/dev.c:5208
 napi_skb_finish net/core/dev.c:5671 [inline]
 napi_gro_receive+0x28f/0x330 net/core/dev.c:5704
 receive_buf+0x284/0x30b0 drivers/net/virtio_net.c:1061
 virtnet_receive drivers/net/virtio_net.c:1323 [inline]
 virtnet_poll+0x436/0x7d0 drivers/net/virtio_net.c:1428
 napi_poll net/core/dev.c:6352 [inline]
 net_rx_action+0x3ae/0xa50 net/core/dev.c:6418

read to 0xffff88812665f32c of 4 bytes by task 7292 on cpu 0:
 tcp_space include/net/tcp.h:1373 [inline]
 tcp_grow_window.isra.0+0x6b/0x480 net/ipv4/tcp_input.c:413
 tcp_event_data_recv+0x68f/0x990 net/ipv4/tcp_input.c:717
 tcp_rcv_established+0xbfe/0xf50 net/ipv4/tcp_input.c:5618
 tcp_v4_do_rcv+0x381/0x4e0 net/ipv4/tcp_ipv4.c:1542
 sk_backlog_rcv include/net/sock.h:945 [inline]
 __release_sock+0x135/0x1e0 net/core/sock.c:2427
 release_sock+0x61/0x160 net/core/sock.c:2943
 tcp_recvmsg+0x63b/0x1a30 net/ipv4/tcp.c:2181
 inet_recvmsg+0xbb/0x250 net/ipv4/af_inet.c:838
 sock_recvmsg_nosec net/socket.c:871 [inline]
 sock_recvmsg net/socket.c:889 [inline]
 sock_recvmsg+0x92/0xb0 net/socket.c:885
 sock_read_iter+0x15f/0x1e0 net/socket.c:967
 call_read_iter include/linux/fs.h:1864 [inline]
 new_sync_read+0x389/0x4f0 fs/read_write.c:414
 __vfs_read+0xb1/0xc0 fs/read_write.c:427
 vfs_read fs/read_write.c:461 [inline]
 vfs_read+0x143/0x2c0 fs/read_write.c:446

Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 7292 Comm: syz-fuzzer Not tainted 5.3.0+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 21:43:00 -07:00
Eric Dumazet
eac66402d1 net: annotate sk->sk_rcvlowat lockless reads
sock_rcvlowat() or int_sk_rcvlowat() might be called without the socket
lock for example from tcp_poll().

Use READ_ONCE() to document the fact that other cpus might change
sk->sk_rcvlowat under us and avoid KCSAN splats.

Use WRITE_ONCE() on write sides too.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 21:43:00 -07:00
Eric Dumazet
1f142c17d1 tcp: annotate lockless access to tcp_memory_pressure
tcp_memory_pressure is read without holding any lock,
and its value could be changed on other cpus.

Use READ_ONCE() to annotate these lockless reads.

The write side is already using atomic ops.

Fixes: b8da51ebb1 ("tcp: introduce tcp_under_memory_pressure()")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 21:35:00 -07:00
Eric Dumazet
60b173ca3d net: add {READ|WRITE}_ONCE() annotations on ->rskq_accept_head
reqsk_queue_empty() is called from inet_csk_listen_poll() while
other cpus might write ->rskq_accept_head value.

Use {READ|WRITE}_ONCE() to avoid compiler tricks
and potential KCSAN splats.

Fixes: fff1f3001c ("tcp: add a spinlock to protect struct request_sock_queue")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 21:34:31 -07:00
Ursula Braun
c3d9494e68 net/smc: no new connections on disappearing devices
Add a "going_away" indication to ISM devices and IB ports and
avoid creation of new connections on such disappearing devices.

And do not handle ISM events if ISM device is disappearing.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 19:45:44 -07:00
Ursula Braun
a0a62ee15a net/smc: separate locks for SMCD and SMCR link group lists
This patch introduces separate locks for the split SMCD and SMCR
link group lists.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 19:45:44 -07:00
Ursula Braun
a2351c5d86 net/smc: separate SMCD and SMCR link group lists
Currently SMCD and SMCR link groups are maintained in one list.
To facilitate abnormal termination handling they are split into
a separate list for SMCR link groups and separate lists for SMCD
link groups per SMCD device.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 19:45:44 -07:00
Xin Long
b6e6b5f1da sctp: add SCTP_SEND_FAILED_EVENT event
This patch is to add a new event SCTP_SEND_FAILED_EVENT described in
rfc6458#section-6.1.11. It's a update of SCTP_SEND_FAILED event:

  struct sctp_sndrcvinfo ssf_info is replaced with
  struct sctp_sndinfo ssfe_info in struct sctp_send_failed_event.

SCTP_SEND_FAILED is being deprecated, but we don't remove it in this
patch. Both are being processed in sctp_datamsg_destroy() when the
corresp event flag is set.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 17:06:58 -07:00
Xin Long
4b7740324e sctp: add SCTP_ADDR_ADDED event
A helper sctp_ulpevent_nofity_peer_addr_change() will be extracted
to make peer_addr_change event and enqueue it, and the helper will
be called in sctp_assoc_add_peer() to send SCTP_ADDR_ADDED event.

This event is described in rfc6458#section-6.1.2:

  SCTP_ADDR_ADDED:  The address is now part of the association.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 17:06:57 -07:00
Xin Long
819be8108f sctp: add chunks to sk_backlog when the newsk sk_socket is not set
This patch is to fix a NULL-ptr deref in selinux_socket_connect_helper:

  [...] kasan: GPF could be caused by NULL-ptr deref or user memory access
  [...] RIP: 0010:selinux_socket_connect_helper+0x94/0x460
  [...] Call Trace:
  [...]  selinux_sctp_bind_connect+0x16a/0x1d0
  [...]  security_sctp_bind_connect+0x58/0x90
  [...]  sctp_process_asconf+0xa52/0xfd0 [sctp]
  [...]  sctp_sf_do_asconf+0x785/0x980 [sctp]
  [...]  sctp_do_sm+0x175/0x5a0 [sctp]
  [...]  sctp_assoc_bh_rcv+0x285/0x5b0 [sctp]
  [...]  sctp_backlog_rcv+0x482/0x910 [sctp]
  [...]  __release_sock+0x11e/0x310
  [...]  release_sock+0x4f/0x180
  [...]  sctp_accept+0x3f9/0x5a0 [sctp]
  [...]  inet_accept+0xe7/0x720

It was caused by that the 'newsk' sk_socket was not set before going to
security sctp hook when processing asconf chunk with SCTP_PARAM_ADD_IP
or SCTP_PARAM_SET_PRIMARY:

  inet_accept()->
    sctp_accept():
      lock_sock():
          lock listening 'sk'
                                          do_softirq():
                                            sctp_rcv():  <-- [1]
                                                asconf chunk arrives and
                                                enqueued in 'sk' backlog
      sctp_sock_migrate():
          set asoc's sk to 'newsk'
      release_sock():
          sctp_backlog_rcv():
            lock 'newsk'
            sctp_process_asconf()  <-- [2]
            unlock 'newsk'
    sock_graft():
        set sk_socket  <-- [3]

As it shows, at [1] the asconf chunk would be put into the listening 'sk'
backlog, as accept() was holding its sock lock. Then at [2] asconf would
get processed with 'newsk' as asoc's sk had been set to 'newsk'. However,
'newsk' sk_socket is not set until [3], while selinux_sctp_bind_connect()
would deref it, then kernel crashed.

Here to fix it by adding the chunk to sk_backlog until newsk sk_socket is
set when .accept() is done.

Note that sk->sk_socket can be NULL when the sock is closed, so SOCK_DEAD
flag is also needed to check in sctp_newsk_ready().

Thanks to Ondrej for reviewing the code.

Fixes: d452930fd3 ("selinux: Add SCTP support")
Reported-by: Ying Xu <yinxu@redhat.com>
Suggested-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-09 16:27:04 -07:00
Qian Cai
5facae4f35 locking/lockdep: Remove unused @nested argument from lock_release()
Since the following commit:

  b4adfe8e05 ("locking/lockdep: Remove unused argument in __lock_release")

@nested is no longer used in lock_release(), so remove it from all
lock_release() calls and friends.

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Will Deacon <will@kernel.org>
Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: airlied@linux.ie
Cc: akpm@linux-foundation.org
Cc: alexander.levin@microsoft.com
Cc: daniel@iogearbox.net
Cc: davem@davemloft.net
Cc: dri-devel@lists.freedesktop.org
Cc: duyuyang@gmail.com
Cc: gregkh@linuxfoundation.org
Cc: hannes@cmpxchg.org
Cc: intel-gfx@lists.freedesktop.org
Cc: jack@suse.com
Cc: jlbec@evilplan.or
Cc: joonas.lahtinen@linux.intel.com
Cc: joseph.qi@linux.alibaba.com
Cc: jslaby@suse.com
Cc: juri.lelli@redhat.com
Cc: maarten.lankhorst@linux.intel.com
Cc: mark@fasheh.com
Cc: mhocko@kernel.org
Cc: mripard@kernel.org
Cc: ocfs2-devel@oss.oracle.com
Cc: rodrigo.vivi@intel.com
Cc: sean@poorly.run
Cc: st@kernel.org
Cc: tj@kernel.org
Cc: tytso@mit.edu
Cc: vdavydov.dev@gmail.com
Cc: vincent.guittot@linaro.org
Cc: viro@zeniv.linux.org.uk
Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar <mingo@kernel.org>
2019-10-09 12:46:10 +02:00
Alexey Dobriyan
fd1ac07f3f xfrm: ifdef setsockopt(UDP_ENCAP_ESPINUDP/UDP_ENCAP_ESPINUDP_NON_IKE)
If IPsec is not configured, there is no reason to delay the inevitable.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-10-09 06:55:32 +02:00
Jakub Kicinski
a17fd2cf2d A number of fixes:
* allow scanning when operating on radar channels in
    ETSI regdomains
  * accept deauth frames in IBSS - we have code to parse
    and handle them, but were dropping them early
  * fix an allocation failure path in hwsim
  * fix a failure path memory leak in nl80211 FTM code
  * fix RCU handling & locking in multi-BSSID parsing
  * reject malformed SSID in mac80211 (this shouldn't
    really be able to happen, but defense in depth)
  * avoid userspace buffer overrun in ancient wext code
    if SSID was too long
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl2cf4YACgkQB8qZga/f
 l8T+FA/+J4yDjue4Zw1cX8jgBr3+8l2+5sJIs59rGAcR30HYHtQvkaq4UrOy3kJ+
 w4zB2lqMIeMA/tT3+OaTLF2ELz1lBN5VqZ4fnNguFuYDOkNjzkZwQCe51z6gMwrW
 xk3dxLJz+QCDO7GF3nQer0VFBChL7fVdlIYyEoq/obwYylOUqyRGxtXa8yIaxaEe
 xE7OK67g8rVVestOGO+i/7F/ZJ6DX0hQ0ounWRoa1tTHvGjInD7t2qu69n6/66zO
 5AIVbDXGaQ+Eegk54jlJcJKQI367WNpRqOJ2HUonkoAKWID8kxbp8LLWne3DqC9i
 Ffmme2/johrGSlvgPBUCV9SUmxulQnf8vpt8mKMh2Po+1mLInFltPSPckkpoe12S
 5ps47miIZQDnVEfaojG7I2LMlfSYM1olBkAQqcqiizMISn4AYvu8H0NXJBJXV+7j
 ezgcyPCi9x2wWYy2KhKsoXPzHo8cdRUT0HbmQUe4zUXIhrlH7CZHHhkgnzR04Qz5
 Mh1TWnKr1RJEC5sua1cVDeEBXyjJDfUI+zRuq6jIvWIya+QEQ0ZWUkdjTtLFIJkY
 89e932SuHsUFEYqYYebyDnUQ7q8ZNGPVNpfC7/CTnU/T/uJoTFYDsB6ocmK3k9Cn
 4AFC/EBXD4WAl2FXDdUdAMPyLLxQwc9AxJlXMq1LOyCIvQDGopY=
 =k685
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-10-08' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
A number of fixes:
 * allow scanning when operating on radar channels in
   ETSI regdomains
 * accept deauth frames in IBSS - we have code to parse
   and handle them, but were dropping them early
 * fix an allocation failure path in hwsim
 * fix a failure path memory leak in nl80211 FTM code
 * fix RCU handling & locking in multi-BSSID parsing
 * reject malformed SSID in mac80211 (this shouldn't
   really be able to happen, but defense in depth)
 * avoid userspace buffer overrun in ancient wext code
   if SSID was too long
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-08 19:31:01 -07:00
Eric Biggers
b74555de21 llc: fix sk_buff leak in llc_conn_service()
syzbot reported:

    BUG: memory leak
    unreferenced object 0xffff88811eb3de00 (size 224):
       comm "syz-executor559", pid 7315, jiffies 4294943019 (age 10.300s)
       hex dump (first 32 bytes):
         00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
         00 a0 38 24 81 88 ff ff 00 c0 f2 15 81 88 ff ff  ..8$............
       backtrace:
         [<000000008d1c66a1>] kmemleak_alloc_recursive  include/linux/kmemleak.h:55 [inline]
         [<000000008d1c66a1>] slab_post_alloc_hook mm/slab.h:439 [inline]
         [<000000008d1c66a1>] slab_alloc_node mm/slab.c:3269 [inline]
         [<000000008d1c66a1>] kmem_cache_alloc_node+0x153/0x2a0 mm/slab.c:3579
         [<00000000447d9496>] __alloc_skb+0x6e/0x210 net/core/skbuff.c:198
         [<000000000cdbf82f>] alloc_skb include/linux/skbuff.h:1058 [inline]
         [<000000000cdbf82f>] llc_alloc_frame+0x66/0x110 net/llc/llc_sap.c:54
         [<000000002418b52e>] llc_conn_ac_send_sabme_cmd_p_set_x+0x2f/0x140  net/llc/llc_c_ac.c:777
         [<000000001372ae17>] llc_exec_conn_trans_actions net/llc/llc_conn.c:475  [inline]
         [<000000001372ae17>] llc_conn_service net/llc/llc_conn.c:400 [inline]
         [<000000001372ae17>] llc_conn_state_process+0x1ac/0x640  net/llc/llc_conn.c:75
         [<00000000f27e53c1>] llc_establish_connection+0x110/0x170  net/llc/llc_if.c:109
         [<00000000291b2ca0>] llc_ui_connect+0x10e/0x370 net/llc/af_llc.c:477
         [<000000000f9c740b>] __sys_connect+0x11d/0x170 net/socket.c:1840
         [...]

The bug is that most callers of llc_conn_send_pdu() assume it consumes a
reference to the skb, when actually due to commit b85ab56c3f ("llc:
properly handle dev_queue_xmit() return value") it doesn't.

Revert most of that commit, and instead make the few places that need
llc_conn_send_pdu() to *not* consume a reference call skb_get() before.

Fixes: b85ab56c3f ("llc: properly handle dev_queue_xmit() return value")
Reported-by: syzbot+6b825a6494a04cc0e3f7@syzkaller.appspotmail.com
Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-10-08 13:23:05 -07:00
Haishuang Yan
5d5a0815f8 ipvs: batch __ip_vs_cleanup
It's better to batch __ip_vs_cleanup to speedup ipvs
connections dismantle.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2019-10-08 11:28:33 +02:00
Aaron Komisar
dc0c18ed22 mac80211: fix scan when operating on DFS channels in ETSI domains
In non-ETSI regulatory domains scan is blocked when operating channel
is a DFS channel. For ETSI, however, once DFS channel is marked as
available after the CAC, this channel will remain available (for some
time) even after leaving this channel.

Therefore a scan can be done without any impact on the availability
of the DFS channel as no new CAC is required after the scan.

Enable scan in mac80211 in these cases.

Signed-off-by: Aaron Komisar <aaron.komisar@tandemg.com>
Link: https://lore.kernel.org/r/1570024728-17284-1-git-send-email-aaron.komisar@tandemg.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-07 22:10:50 +02:00
Jakub Kicinski
bc76e5bb12 net/tls: store decrypted on a single bit
Use a single bit instead of boolean to remember if packet
was already decrypted.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-07 09:58:28 -04:00
Jakub Kicinski
5c5458ec9d net/tls: store async_capable on a single bit
Store async_capable on a single bit instead of a full integer
to save space.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-07 09:58:27 -04:00
Jakub Kicinski
4de30a8d58 net/tls: pass context to tls_device_decrypted()
Avoid unnecessary pointer chasing and calculations, callers already
have most of the state tls_device_decrypted() needs.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-07 09:58:27 -04:00
David S. Miller
3289086210 ipv6: Make ipv6_mc_may_pull() return bool.
Consistent with how pskb_may_pull() also now does so.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-07 09:37:27 -04:00
Jiri Pirko
265ecd4fa3 net: genetlink: remove unused genl_family_attrbuf()
genl_family_attrbuf() function is no longer used by anyone, so remove it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-06 15:44:47 +02:00
Jiri Pirko
bf813b0afe net: genetlink: parse attrs and store in contect info struct during dumpit
Extend the dumpit info struct for attrs. Instead of existing attribute
validation do parse them and save in the info struct. Caller can benefit
from this and does not have to do parse itself. In order to properly
free attrs, genl_family pointer needs to be added to dumpit info struct
as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-06 15:44:47 +02:00
Jiri Pirko
1927f41a22 net: genetlink: introduce dump info struct to be available during dumpit op
Currently the cb->data is taken by ops during non-parallel dumping.
Introduce a new structure genl_dumpit_info and store the ops there.
Distribute the info to both non-parallel and parallel dumping. Also add
a helper genl_dumpit_info() to easily get the info structure in the
dumpit callback from cb.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-06 15:44:46 +02:00
Jiri Pirko
8273fd8454 net: devlink: export devlink net setter
For newly allocated devlink instance allow drivers to set net struct

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-05 16:34:15 -07:00
Jakub Kicinski
d26b698dd3 net/tls: add skeleton of MIB statistics
Add a skeleton structure for adding TLS statistics.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-05 16:29:00 -07:00
Jakub Kicinski
8538d29cea net/tls: add tracing for device/offload events
Add tracing of device-related interaction to aid performance
analysis, especially around resync:

 tls:tls_device_offload_set
 tls:tls_device_rx_resync_send
 tls:tls_device_rx_resync_nh_schedule
 tls:tls_device_rx_resync_nh_delay
 tls:tls_device_tx_resync_req
 tls:tls_device_tx_resync_send

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-05 16:29:00 -07:00
Alexey Dobriyan
193d357d08 net: spread "enum sock_flags"
Some ints are "enum sock_flags" in fact.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 14:26:03 -07:00
Jakub Kicinski
0eb8745e03 net/tls: rename tls_hw_* functions tls_toe_*
The tls_hw_* functions are quite confusingly named, since they
are related to the TOE-offload, not TLS_HW offload which doesn't
require TOE. Rename them.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 14:07:07 -07:00
Jakub Kicinski
08700dab81 net/tls: move TOE-related code to a separate file
Move tls_hw_* functions to a new, separate source file
to avoid confusion with normal, non-TOE offload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 14:07:07 -07:00
Jakub Kicinski
f21912edd1 net/tls: rename tls_device to tls_toe_device
Rename struct tls_device to struct tls_toe_device to avoid
confusion with normal, non-TOE offload.

No functional changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 14:07:07 -07:00
Jakub Kicinski
25a3cd8189 net/tls: move TOE-related structures to a separate header
Move tls_device structure and register/unregister functions
to a new header to avoid confusion with normal, non-TOE offload.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 14:07:06 -07:00
Jiri Pirko
070c63f20f net: devlink: allow to change namespaces during reload
All devlink instances are created in init_net and stay there for a
lifetime. Allow user to be able to move devlink instances into
namespaces during devlink reload operation. That ensures proper
re-instantiation of driver objects, including netdevices.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 11:10:56 -07:00
Jiri Pirko
471f894f10 net: devlink: export devlink net getter
Allow drivers to get net struct for devlink instance.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 11:10:56 -07:00
Jiri Pirko
b7a595577e net: fib_notifier: propagate extack down to the notifier block callback
Since errors are propagated all the way up to the caller, propagate
possible extack of the caller all the way down to the notifier block
callback.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 11:10:56 -07:00
Jiri Pirko
55c894f762 net: fib_notifier: propagate possible error during fib notifier registration
Unlike events for registered notifier, during the registration, the
errors that happened for the block being registered are not propagated
up to the caller. Make sure the error is propagated for FIB rules and
entries.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 11:10:56 -07:00
Jiri Pirko
7c550daffe net: fib_notifier: make FIB notifier per-netns
Currently all users of FIB notifier only cares about events in init_net.
Later in this patchset, users get interested in other namespaces too.
However, for every registered block user is interested only about one
namespace. Make the FIB notifier registration per-netns and avoid
unnecessary calls of notifier block for other namespaces.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-04 11:10:55 -07:00
Johannes Berg
2ce113de31 mac80211: simplify TX aggregation start
There really is no need to make drivers call the
ieee80211_start_tx_ba_cb_irqsafe() function and then
schedule the worker if all we want is to set a bit.

Add a new return value (that was previously considered
invalid) to indicate that the driver is immediately
ready for the session, and make drivers use it. The
only drivers that remain different are the Intel ones
as they need to negotiate more with the firmware.

Link: https://lore.kernel.org/r/1570007543-I152912660131cbab2e5d80b4218238c20f8a06e5@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-10-04 13:58:13 +02:00
Vladimir Oltean
37048e94a2 net: dsa: Remove unused __DSA_SKB_CB macro
The struct __dsa_skb_cb is supposed to span the entire 48-byte skb
control block, while the struct dsa_skb_cb only the portion of it which
is used by the DSA core (the rest is available as private data to
drivers).

The DSA_SKB_CB and __DSA_SKB_CB helpers are supposed to help retrieve
this pointer based on a skb, but it turns out there is nobody directly
interested in the struct __dsa_skb_cb in the kernel. So remove it.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-02 14:17:11 -07:00
Jiri Pirko
a30c7b429f net: introduce per-netns netdevice notifiers
Often the code for example in drivers is interested in getting notifier
call only from certain network namespace. In addition to the existing
global netdevice notifier chain introduce per-netns chains and allow
users to register to that. Eventually this would eliminate unnecessary
overhead in case there are many netdevices in many network namespaces.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-02 11:48:44 -04:00
Jiri Pirko
afa0df5998 net: push loops and nb calls into helper functions
Push iterations over net namespaces and netdevices from
register_netdevice_notifier() and unregister_netdevice_notifier()
into helper functions. Along with that introduce continue_reverse macros
to make the code a bit nicer allowing to get rid of "last" marks.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-02 11:48:44 -04:00
Eric Dumazet
be2644aac3 tcp: add ipv6_addr_v4mapped_loopback() helper
tcp_twsk_unique() has a hard coded assumption about ipv4 loopback
being 127/8

Lets instead use the standard ipv4_is_loopback() method,
in a new ipv6_addr_v4mapped_loopback() helper.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-10-01 13:07:53 -07:00
David S. Miller
c5f095baa8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Add NFT_CHAIN_POLICY_UNSET to replace hardcoded -1 to
   specify that the chain policy is unset. The chain policy
   field is actually defined as an 8-bit unsigned integer.

2) Remove always true condition reported by smatch in
   chain policy check.

3) Fix element lookup on dynamic sets, from Florian Westphal.

4) Use __u8 in ebtables uapi header, from Masahiro Yamada.

5) Bogus EBUSY when removing flowtable after chain flush,
   from Laura Garcia Liebana.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 20:15:00 +02:00
Eric Dumazet
f6c0f5d209 tcp: honor SO_PRIORITY in TIME_WAIT state
ctl packets sent on behalf of TIME_WAIT sockets currently
have a zero skb->priority, which can cause various problems.

In this patch we :

- add a tw_priority field in struct inet_timewait_sock.

- populate it from sk->sk_priority when a TIME_WAIT is created.

- For IPv4, change ip_send_unicast_reply() and its two
  callers to propagate tw_priority correctly.
  ip_send_unicast_reply() no longer changes sk->sk_priority.

- For IPv6, make sure TIME_WAIT sockets pass their tw_priority
  field to tcp_v6_send_response() and tcp_v6_send_ack().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
Eric Dumazet
4f6570d720 ipv6: add priority parameter to ip6_xmit()
Currently, ip6_xmit() sets skb->priority based on sk->sk_priority

This is not desirable for TCP since TCP shares the same ctl socket
for a given netns. We want to be able to send RST or ACK packets
with a non zero skb->priority.

This patch has no functional change.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 12:05:02 +02:00
Eric Dumazet
159d2c7d81 sch_netem: fix rcu splat in netem_enqueue()
qdisc_root() use from netem_enqueue() triggers a lockdep warning.

__dev_queue_xmit() uses rcu_read_lock_bh() which is
not equivalent to rcu_read_lock() + local_bh_disable_bh as far
as lockdep is concerned.

WARNING: suspicious RCU usage
5.3.0-rc7+ #0 Not tainted
-----------------------------
include/net/sch_generic.h:492 suspicious rcu_dereference_check() usage!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
3 locks held by syz-executor427/8855:
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: lwtunnel_xmit_redirect include/net/lwtunnel.h:92 [inline]
 #0: 00000000b5525c01 (rcu_read_lock_bh){....}, at: ip_finish_output2+0x2dc/0x2570 net/ipv4/ip_output.c:214
 #1: 00000000b5525c01 (rcu_read_lock_bh){....}, at: __dev_queue_xmit+0x20a/0x3650 net/core/dev.c:3804
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: spin_lock include/linux/spinlock.h:338 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_xmit_skb net/core/dev.c:3502 [inline]
 #2: 00000000364bae92 (&(&sch->q.lock)->rlock){+.-.}, at: __dev_queue_xmit+0x14b8/0x3650 net/core/dev.c:3838

stack backtrace:
CPU: 0 PID: 8855 Comm: syz-executor427 Not tainted 5.3.0-rc7+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 lockdep_rcu_suspicious+0x153/0x15d kernel/locking/lockdep.c:5357
 qdisc_root include/net/sch_generic.h:492 [inline]
 netem_enqueue+0x1cfb/0x2d80 net/sched/sch_netem.c:479
 __dev_xmit_skb net/core/dev.c:3527 [inline]
 __dev_queue_xmit+0x15d2/0x3650 net/core/dev.c:3838
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3902
 neigh_hh_output include/net/neighbour.h:500 [inline]
 neigh_output include/net/neighbour.h:509 [inline]
 ip_finish_output2+0x1726/0x2570 net/ipv4/ip_output.c:228
 __ip_finish_output net/ipv4/ip_output.c:308 [inline]
 __ip_finish_output+0x5fc/0xb90 net/ipv4/ip_output.c:290
 ip_finish_output+0x38/0x1f0 net/ipv4/ip_output.c:318
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip_mc_output+0x292/0xf40 net/ipv4/ip_output.c:417
 dst_output include/net/dst.h:436 [inline]
 ip_local_out+0xbb/0x190 net/ipv4/ip_output.c:125
 ip_send_skb+0x42/0xf0 net/ipv4/ip_output.c:1555
 udp_send_skb.isra.0+0x6b2/0x1160 net/ipv4/udp.c:887
 udp_sendmsg+0x1e96/0x2820 net/ipv4/udp.c:1174
 inet_sendmsg+0x9e/0xe0 net/ipv4/af_inet.c:807
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:657
 ___sys_sendmsg+0x3e2/0x920 net/socket.c:2311
 __sys_sendmmsg+0x1bf/0x4d0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg net/socket.c:2439 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2439
 do_syscall_64+0xfd/0x6a0 arch/x86/entry/common.c:296
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-27 10:29:11 +02:00
Laura Garcia Liebana
9b05b6e11d netfilter: nf_tables: bogus EBUSY when deleting flowtable after flush
The deletion of a flowtable after a flush in the same transaction
results in EBUSY. This patch adds an activation and deactivation of
flowtables in order to update the _use_ counter.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-25 11:01:19 +02:00
David Ahern
77d5bc7e6a ipv4: Revert removal of rt_uses_gateway
Julian noted that rt_uses_gateway has a more subtle use than 'is gateway
set':
    https://lore.kernel.org/netdev/alpine.LFD.2.21.1909151104060.2546@ja.home.ssi.bg/

Revert that part of the commit referenced in the Fixes tag.

Currently, there are no u8 holes in 'struct rtable'. There is a 4-byte hole
in the second cacheline which contains the gateway declaration. So move
rt_gw_family down to the gateway declarations since they are always used
together, and then re-use that u8 for rt_uses_gateway. End result is that
rtable size is unchanged.

Fixes: 1550c17193 ("ipv4: Prepare rtable for IPv6 gateway")
Reported-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-09-20 18:23:33 -07:00
Pablo Neira Ayuso
ad652f3811 netfilter: nf_tables: add NFT_CHAIN_POLICY_UNSET and use it
Default policy is defined as a unsigned 8-bit field, do not use a
negative value to leave it unset, use this new NFT_CHAIN_POLICY_UNSET
instead.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-20 10:20:02 +02:00
Linus Torvalds
81160dda9a Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from David Miller:

 1) Support IPV6 RA Captive Portal Identifier, from Maciej Żenczykowski.

 2) Use bio_vec in the networking instead of custom skb_frag_t, from
    Matthew Wilcox.

 3) Make use of xmit_more in r8169 driver, from Heiner Kallweit.

 4) Add devmap_hash to xdp, from Toke Høiland-Jørgensen.

 5) Support all variants of 5750X bnxt_en chips, from Michael Chan.

 6) More RTNL avoidance work in the core and mlx5 driver, from Vlad
    Buslov.

 7) Add TCP syn cookies bpf helper, from Petar Penkov.

 8) Add 'nettest' to selftests and use it, from David Ahern.

 9) Add extack support to drop_monitor, add packet alert mode and
    support for HW drops, from Ido Schimmel.

10) Add VLAN offload to stmmac, from Jose Abreu.

11) Lots of devm_platform_ioremap_resource() conversions, from
    YueHaibing.

12) Add IONIC driver, from Shannon Nelson.

13) Several kTLS cleanups, from Jakub Kicinski.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1930 commits)
  mlxsw: spectrum_buffers: Add the ability to query the CPU port's shared buffer
  mlxsw: spectrum: Register CPU port with devlink
  mlxsw: spectrum_buffers: Prevent changing CPU port's configuration
  net: ena: fix incorrect update of intr_delay_resolution
  net: ena: fix retrieval of nonadaptive interrupt moderation intervals
  net: ena: fix update of interrupt moderation register
  net: ena: remove all old adaptive rx interrupt moderation code from ena_com
  net: ena: remove ena_restore_ethtool_params() and relevant fields
  net: ena: remove old adaptive interrupt moderation code from ena_netdev
  net: ena: remove code duplication in ena_com_update_nonadaptive_moderation_interval _*()
  net: ena: enable the interrupt_moderation in driver_supported_features
  net: ena: reimplement set/get_coalesce()
  net: ena: switch to dim algorithm for rx adaptive interrupt moderation
  net: ena: add intr_moder_rx_interval to struct ena_com_dev and use it
  net: phy: adin: implement Energy Detect Powerdown mode via phy-tunable
  ethtool: implement Energy Detect Powerdown support via phy-tunable
  xen-netfront: do not assume sk_buff_head list is empty in error handling
  s390/ctcm: Delete unnecessary checks before the macro call “dev_kfree_skb”
  net: ena: don't wake up tx queue when down
  drop_monitor: Better sanitize notified packets
  ...
2019-09-18 12:34:53 -07:00
Linus Torvalds
8b53c76533 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Add the ability to abort a skcipher walk.

  Algorithms:
   - Fix XTS to actually do the stealing.
   - Add library helpers for AES and DES for single-block users.
   - Add library helpers for SHA256.
   - Add new DES key verification helper.
   - Add surrounding bits for ESSIV generator.
   - Add accelerations for aegis128.
   - Add test vectors for lzo-rle.

  Drivers:
   - Add i.MX8MQ support to caam.
   - Add gcm/ccm/cfb/ofb aes support in inside-secure.
   - Add ofb/cfb aes support in media-tek.
   - Add HiSilicon ZIP accelerator support.

  Others:
   - Fix potential race condition in padata.
   - Use unbound workqueues in padata"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (311 commits)
  crypto: caam - Cast to long first before pointer conversion
  crypto: ccree - enable CTS support in AES-XTS
  crypto: inside-secure - Probe transform record cache RAM sizes
  crypto: inside-secure - Base RD fetchcount on actual RD FIFO size
  crypto: inside-secure - Base CD fetchcount on actual CD FIFO size
  crypto: inside-secure - Enable extended algorithms on newer HW
  crypto: inside-secure: Corrected configuration of EIP96_TOKEN_CTRL
  crypto: inside-secure - Add EIP97/EIP197 and endianness detection
  padata: remove cpu_index from the parallel_queue
  padata: unbind parallel jobs from specific CPUs
  padata: use separate workqueues for parallel and serial work
  padata, pcrypt: take CPU hotplug lock internally in padata_alloc_possible
  crypto: pcrypt - remove padata cpumask notifier
  padata: make padata_do_parallel find alternate callback CPU
  workqueue: require CPU hotplug read exclusion for apply_workqueue_attrs
  workqueue: unconfine alloc/apply/free_workqueue_attrs()
  padata: allocate workqueue internally
  arm64: dts: imx8mq: Add CAAM node
  random: Use wait_event_freezable() in add_hwgenerator_randomness()
  crypto: ux500 - Fix COMPILE_TEST warnings
  ...
2019-09-18 12:11:14 -07:00
David S. Miller
1bab8d4c48 Merge ra.kernel.org:/pub/scm/linux/kernel/git/netdev/net
Pull in bug fixes from 'net' tree for the merge window.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-17 23:51:10 +02:00
Vladimir Oltean
47d23af292 net: dsa: Pass ndo_setup_tc slave callback to drivers
DSA currently handles shared block filters (for the classifier-action
qdisc) in the core due to what I believe are simply pragmatic reasons -
hiding the complexity from drivers and offerring a simple API for port
mirroring.

Extend the dsa_slave_setup_tc function by passing all other qdisc
offloads to the driver layer, where the driver may choose what it
implements and how. DSA is simply a pass-through in this case.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Acked-by: Kurt Kanzenbach <kurt@linutronix.de>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 21:32:57 +02:00
Vinicius Costa Gomes
9c66d15646 taprio: Add support for hardware offloading
This allows taprio to offload the schedule enforcement to capable
network cards, resulting in more precise windows and less CPU usage.

The gate mask acts on traffic classes (groups of queues of same
priority), as specified in IEEE 802.1Q-2018, and following the existing
taprio and mqprio semantics.
It is up to the driver to perform conversion between tc and individual
netdev queues if for some reason it needs to make that distinction.

Full offload is requested from the network interface by specifying
"flags 2" in the tc qdisc creation command, which in turn corresponds to
the TCA_TAPRIO_ATTR_FLAG_FULL_OFFLOAD bit.

The important detail here is the clockid which is implicitly /dev/ptpN
for full offload, and hence not configurable.

A reference counting API is added to support the use case where Ethernet
drivers need to keep the taprio offload structure locally (i.e. they are
a multi-port switch driver, and configuring a port depends on the
settings of other ports as well). The refcount_t variable is kept in a
private structure (__tc_taprio_qopt_offload) and not exposed to drivers.

In the future, the private structure might also be expanded with a
backpointer to taprio_sched *q, to implement the notification system
described in the patch (of when admin became oper, or an error occurred,
etc, so the offload can be monitored with 'tc qdisc show').

Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
Signed-off-by: Voon Weifeng <weifeng.voon@intel.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 21:32:57 +02:00
Vlad Buslov
470d5060e6 net: sched: use get_dev() action API in flow_action infra
When filling in hardware intermediate representation tc_setup_flow_action()
directly obtains, checks and takes reference to dev used by mirred action,
instead of using act->ops->get_dev() API created specifically for this
purpose. In order to remove code duplication, refactor flow_action infra to
use action API when obtaining mirred action target dev. Extend get_dev()
with additional argument that is used to provide dev destructor to the
user.

Fixes: 5a6ff4b13d ("net: sched: take reference to action dev before calling offloads")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 09:18:03 +02:00
Vlad Buslov
4a5da47d5c net: sched: take reference to psample group in flow_action infra
With recent patch set that removed rtnl lock dependency from cls hardware
offload API rtnl lock is only taken when reading action data and can be
released after action-specific data is parsed into intermediate
representation. However, sample action psample group is passed by pointer
without obtaining reference to it first, which makes it possible to
concurrently overwrite the action and deallocate object pointed by
psample_group pointer after rtnl lock is released but before driver
finished using the pointer.

To prevent such race condition, obtain reference to psample group while it
is used by flow_action infra. Extend psample API with function
psample_group_take() that increments psample group reference counter.
Extend struct tc_action_ops with new get_psample_group() API. Implement the
API for action sample using psample_group_take() and already existing
psample_group_put() as a destructor. Use it in tc_setup_flow_action() to
take reference to psample group pointed to by entry->sample.psample_group
and release it in tc_cleanup_flow_action().

Disable bh when taking psample_groups_lock. The lock is now taken while
holding action tcf_lock that is used by data path and requires bh to be
disabled, so doing the same for psample_groups_lock is necessary to
preserve SOFTIRQ-irq-safety.

Fixes: 918190f50e ("net: sched: flower: don't take rtnl lock for cls hw offloads API")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 09:18:03 +02:00
Vlad Buslov
1158958a21 net: sched: extend flow_action_entry with destructor
Generalize flow_action_entry cleanup by extending the structure with
pointer to destructor function. Set the destructor in
tc_setup_flow_action(). Refactor tc_cleanup_flow_action() to call
entry->destructor() instead of using switch that dispatches by entry->id
and manually executes cleanup.

This refactoring is necessary for following patches in this series that
require destructor to use tc_action->ops callbacks that can't be easily
obtained in tc_cleanup_flow_action().

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 09:18:02 +02:00
Willem de Bruijn
acdcecc612 udp: correct reuseport selection with connected sockets
UDP reuseport groups can hold a mix unconnected and connected sockets.
Ensure that connections only receive all traffic to their 4-tuple.

Fast reuseport returns on the first reuseport match on the assumption
that all matches are equal. Only if connections are present, return to
the previous behavior of scoring all sockets.

Record if connections are present and if so (1) treat such connected
sockets as an independent match from the group, (2) only return
2-tuple matches from reuseport and (3) do not return on the first
2-tuple reuseport match to allow for a higher scoring match later.

New field has_conns is set without locks. No other fields in the
bitmap are modified at runtime and the field is only ever set
unconditionally, so an RMW cannot miss a change.

Fixes: e32ea7e747 ("soreuseport: fast reuseport UDP socket selection")
Link: http://lkml.kernel.org/r/CA+FuTSfRP09aJNYRt04SS6qj22ViiOEWaWmLAwX0psk8-PGNxw@mail.gmail.com
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Craig Gallek <kraig@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-16 09:02:18 +02:00
Paolo Abeni
d518d2ed86 net/sched: fix race between deactivation and dequeue for NOLOCK qdisc
The test implemented by some_qdisc_is_busy() is somewhat loosy for
NOLOCK qdisc, as we may hit the following scenario:

CPU1						CPU2
// in net_tx_action()
clear_bit(__QDISC_STATE_SCHED...);
						// in some_qdisc_is_busy()
						val = (qdisc_is_running(q) ||
						       test_bit(__QDISC_STATE_SCHED,
								&q->state));
						// here val is 0 but...
qdisc_run(q)
// ... CPU1 is going to run the qdisc next

As a conseguence qdisc_run() in net_tx_action() can race with qdisc_reset()
in dev_qdisc_reset(). Such race is not possible for !NOLOCK qdisc as
both the above bit operations are under the root qdisc lock().

After commit 021a17ed79 ("pfifo_fast: drop unneeded additional lock on dequeue")
the race can cause use after free and/or null ptr dereference, but the root
cause is likely older.

This patch addresses the issue explicitly checking for deactivation under
the seqlock for NOLOCK qdisc, so that the qdisc_run() in the critical
scenario becomes a no-op.

Note that the enqueue() op can still execute concurrently with dev_qdisc_reset(),
but that is safe due to the skb_array() locking, and we can't avoid that
for NOLOCK qdiscs.

Fixes: 021a17ed79 ("pfifo_fast: drop unneeded additional lock on dequeue")
Reported-by: Li Shuang <shuali@redhat.com>
Reported-and-tested-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15 20:35:55 +02:00
David S. Miller
aa2eaa8c27 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor overlapping changes in the btusb and ixgbe drivers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-15 14:17:27 +02:00
Jiri Pirko
2670ac2625 net: devlink: move reload fail indication to devlink core and expose to user
Currently the fact that devlink reload failed is stored in drivers.
Move this flag into devlink core. Also, expose it to the user.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13 22:11:14 +02:00
Jiri Pirko
97691069dc net: devlink: split reload op into two
In order to properly implement failure indication during reload,
split the reload op into two ops, one for down phase and one for
up phase.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13 22:11:14 +02:00
Willem de Bruijn
c6af0c227a ip: support SO_MARK cmsg
Enable setting skb->mark for UDP and RAW sockets using cmsg.

This is analogous to existing support for TOS, TTL, txtime, etc.

Packet sockets already support this as of commit c7d39e3263
("packet: support per-packet fwmark for af_packet sendmsg").

Similar to other fields, implement by
1. initialize the sockcm_cookie.mark from socket option sk_mark
2. optionally overwrite this in ip_cmsg_send/ip6_datagram_send_ctl
3. initialize inet_cork.mark from sockcm_cookie.mark
4. initialize each (usually just one) skb->mark from inet_cork.mark

Step 1 is handled in one location for most protocols by ipcm_init_sk
as of commit 351782067b ("ipv4: ipcm_cookie initializers").

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13 21:44:19 +02:00
David S. Miller
022c10d6c7 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Fix error path of nf_tables_updobj(), from Dan Carpenter.

2) Move large structure away from stack in the nf_tables offload
   infrastructure, from Arnd Bergmann.

3) Move indirect flow_block logic to nf_tables_offload.

4) Support for synproxy objects, from Fernando Fernandez Mancera.

5) Support for fwd and dup offload.

6) Add __nft_offload_get_chain() helper, this implicitly fixes missing
   mutex and check for offload flags in the indirect block support,
   patch from wenxu.

7) Remove rules on device unregistration, from wenxu. This includes
   two preparation patches to reuse nft_flow_offload_chain() and
   nft_flow_offload_rule().

Large batch from Jeremy Sowden to make a second pass to the
CONFIG_HEADER_TEST support and a bit of housekeeping:

8) Missing include guard in conntrack label header, from Jeremy Sowden.

9) A few coding style errors: trailing whitespace, incorrect indent in
   Kconfig, and semicolons at the end of function definitions.

10) Remove unused ipt_init() and ip6t_init() declarations.

11) Inline xt_hashlimit, ebt_802_3 and xt_physdev headers. They are
    only used once.

12) Update include directive in several netfilter files.

13) Remove unused include/net/netfilter/ipv6/nf_conntrack_icmpv6.h.

14) Move nf_ip6_ext_hdr() to include/linux/netfilter_ipv6.h

15) Move several synproxy structure definitions to nf_synproxy.h

16) Move nf_bridge_frag_data structure to include/linux/netfilter_bridge.h

17) Clean up static inline definitions in nf_conntrack_ecache.h.

18) Replace defined(CONFIG...) || defined(CONFIG...MODULE) with IS_ENABLED(CONFIG...).

19) Missing inline function conditional definitions based on Kconfig
    preferences in synproxy and nf_conntrack_timeout.

20) Update br_nf_pre_routing_ipv6() definition.

21) Move conntrack code in linux/skbuff.h to nf_conntrack headers.

22) Several patches to remove superfluous CONFIG_NETFILTER and
    CONFIG_NF_CONNTRACK checks in headers, coming from the initial batch
    support for CONFIG_HEADER_TEST for netfilter.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-13 14:22:43 +01:00
Jeremy Sowden
0d32e7048d netfilter: conntrack: remove two unused functions from nf_conntrack_timestamp.h.
Two inline functions defined in nf_conntrack_timestamp.h,
`nf_ct_tstamp_enabled` and `nf_ct_set_tstamp`, are not called anywhere.
Remove them.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:48:09 +02:00
Jeremy Sowden
1f1475f38b netfilter: conntrack: remove CONFIG_NF_CONNTRACK checks from nf_conntrack_zones.h.
nf_conntrack_zones.h was wrapped in a CONFIG_NF_CONNTRACK check in order
to fix compilation failures:

  37ee3d5b3e ("netfilter: nf_defrag_ipv4: fix compilation error with NF_CONNTRACK=n")

Subsequent changes mean that these failures will no longer occur and the
check is unnecessary.  Remove it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:41 +02:00
Jeremy Sowden
f19438bdd4 netfilter: remove CONFIG_NETFILTER checks from headers.
`struct nf_hook_ops`, `struct nf_hook_state` and the `nf_hookfn`
function typedef appear in function and struct declarations and
definitions in a number of netfilter headers.  The structs and typedef
themselves are defined by linux/netfilter.h but only when
CONFIG_NETFILTER is enabled.  Define them unconditionally and add
forward declarations in order to remove CONFIG_NETFILTER conditionals
from the other headers.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:36 +02:00
Jeremy Sowden
51a21be42a netfilter: conntrack: remove CONFIG_NF_CONNTRACK check from nf_conntrack_acct.h.
There is a superfluous `#if IS_ENABLED(CONFIG_NF_CONNTRACK)` check
wrapping some function declarations.  Remove it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:18 +02:00
Jeremy Sowden
261db6c2fb netfilter: conntrack: move code to linux/nf_conntrack_common.h.
Move some `struct nf_conntrack` code from linux/skbuff.h to
linux/nf_conntrack_common.h.  Together with a couple of helpers for
getting and setting skb->_nfct, it allows us to remove
CONFIG_NF_CONNTRACK checks from net/netfilter/nf_conntrack.h.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:11 +02:00
Jeremy Sowden
f1815650b5 netfilter: br_netfilter: update stub br_nf_pre_routing_ipv6 parameter to void *priv.
The real br_nf_pre_routing_ipv6 function, defined when CONFIG_IPV6 is
enabled, expects `void *priv`, not `const struct nf_hook_ops *ops`.
Update the stub br_nf_pre_routing_ipv6, defined when CONFIG_IPV6 is
disabled, to match.

Fixes: 06198b34a3 ("netfilter: Pass priv instead of nf_hook_ops to netfilter hooks")
Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:11 +02:00
Jeremy Sowden
22e81d7416 netfilter: conntrack: wrap two inline functions in config checks.
nf_conntrack_synproxy.h contains three inline functions.  The contents
of two of them are wrapped in CONFIG_NETFILTER_SYNPROXY checks and just
return NULL if it is not enabled.  The third does nothing if they return
NULL, so wrap its contents as well.

nf_ct_timeout_data is only called if CONFIG_NETFILTER_TIMEOUT is
enabled.  Wrap its contents in a CONFIG_NETFILTER_TIMEOUT check like the
other inline functions in nf_conntrack_timeout.h.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:10 +02:00
Jeremy Sowden
25d7cbcd2b netfilter: replace defined(CONFIG...) || defined(CONFIG...MODULE) with IS_ENABLED(CONFIG...).
A few headers contain instances of:

  #if defined(CONFIG_XXX) or defined(CONFIG_XXX_MODULE)

Replace them with:

  #if IS_ENABLED(CONFIG_XXX)

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:47:09 +02:00
Jeremy Sowden
16b26cde6f netfilter: conntrack: use consistent style when defining inline functions
The header contains some inline functions defined as:

  static inline f (...)
  {
  #ifdef CONFIG_NF_CONNTRACK_EVENTS
    ...
  #else
    ...
  #endif
  }

and a few others as:

  #ifdef CONFIG_NF_CONNTRACK_EVENTS
  static inline f (...)
  {
    ...
  }
  #else
  static inline f (...)
  {
    ...
  }
  #endif

Prefer the former style, which is more numerous.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:46:25 +02:00
Jeremy Sowden
46705b070c netfilter: move nf_bridge_frag_data struct definition to a more appropriate header.
There is a struct definition function in nf_conntrack_bridge.h which is
not specific to conntrack and is used elswhere in netfilter.  Move it
into netfilter_bridge.h.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:35:33 +02:00
Jeremy Sowden
e2f1cbb165 netfilter: synproxy: move code between headers.
There is some non-conntrack code in the nf_conntrack_synproxy.h header.
Move it to the nf_synproxy.h header.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:34:43 +02:00
Jeremy Sowden
8bf3cbe32b netfilter: remove nf_conntrack_icmpv6.h header.
nf_conntrack_icmpv6.h contains two object macros which duplicate macros
in linux/icmpv6.h.  The latter definitions are also visible wherever it
is included, so remove it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:33:06 +02:00
Jeremy Sowden
40d102cde0 netfilter: update include directives.
Include some headers in files which require them, and remove others
which are not required.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 12:33:06 +02:00
Jeremy Sowden
b0edba2af7 netfilter: fix coding-style errors.
Several header-files, Kconfig files and Makefiles have trailing
white-space.  Remove it.

In netfilter/Kconfig, indent the type of CONFIG_NETFILTER_NETLINK_ACCT
correctly.

There are semicolons at the end of two function definitions in
include/net/netfilter/nf_conntrack_acct.h and
include/net/netfilter/nf_conntrack_ecache.h. Remove them.

Fix indentation in nf_conntrack_l4proto.h.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 11:39:38 +02:00
Jeremy Sowden
0286fbc624 netfilter: fix include guards.
nf_conntrack_labels.h has no include guard.  Add it.

The comment following the #endif in the nf_flow_table.h include guard
referred to the wrong macro.  Fix it.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 11:39:38 +02:00
wenxu
06d392cbe3 netfilter: nf_tables_offload: remove rules when the device unregisters
If the net_device unregisters, clean up the offload rules before the
chain is destroy.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-13 10:58:10 +02:00
David S. Miller
c1b3ddf7c3 We have a number of changes, but things are settling down:
* a fix in the new 6 GHz channel support
  * a fix for recent minstrel (rate control) updates
    for an infinite loop
  * handle interface type changes better wrt. management frame
    registrations (for management frames sent to userspace)
  * add in-BSS RX time to survey information
  * handle HW rfkill properly if !CONFIG_RFKILL
  * send deauth on IBSS station expiry, to avoid state mismatches
  * handle deferred crypto tailroom updates in mac80211 better
    when device restart happens
  * fix a spectre-v1 - really a continuation of a previous patch
  * advertise NL80211_CMD_UPDATE_FT_IES as supported if so
  * add some missing parsing in VHT extended NSS support
  * support HE in mac80211_hwsim
  * let mac80211 drivers determine the max MTU themselves
 along with the usual cleanups etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl148l4ACgkQB8qZga/f
 l8TafQ//Yjunnxq49ClKMDxyKwxIBvLRAMr4D3QwaWAcWUpar1/V0Ft/5glnHtbZ
 7QptwbxVNUk/N68hYi8wlpMpvfzv/nShZD8QBNS1bk5E3Gng3yow03LhOx5iaYb2
 KJXS1GE4jwkMD7Xn65+eMeb8rt1vEj8LleX91cguilq+y5YbcNFsP1nil2RtQyBU
 cf5i8CBu4I5rTBoFaRvcz2xn+blqPSm2/piA+yXjzFp9vyVmhD+FjR5T482u48pj
 wi/1zersGVUzBNElnZOKg67XPir1fcJqCfLILr7okPItWuYXHdVHGfn+arhimK4W
 dyIpv1EfCe0lwKl4VTdhXt1GwhKvWCc2Ja7lz/RnDisGq9CYPJNfW7jFgR3tw4eg
 SccnUhnxRIgD1V2KDcvsRadPo+2YsBJBmC61JRX1K6L3zpHLbktDyzxvTwCeVQ9A
 TQp5bmQcKqDoq+/60JNHI6IxMbmX/vc2PC7dENGWUkem/JEmSWBB1wcL9gsRkhVi
 c4uHyvFeXJaIV7YFA36hWCfa0fr+UUYxfzxudRSvxq/tTpayqBKu6fr3Pvv0SqTj
 /BKkezoIdJntClhJv4PcDkZMva4uMCPtHCero9eICX5J+4AanD6caRNefX6L0PfF
 DtN1sscVOy9zY2fV4tqvn3IASmOE5yB/dxjtKkMYyNHkZsGiDI0=
 =HAuT
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2019-09-11' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
We have a number of changes, but things are settling down:
 * a fix in the new 6 GHz channel support
 * a fix for recent minstrel (rate control) updates
   for an infinite loop
 * handle interface type changes better wrt. management frame
   registrations (for management frames sent to userspace)
 * add in-BSS RX time to survey information
 * handle HW rfkill properly if !CONFIG_RFKILL
 * send deauth on IBSS station expiry, to avoid state mismatches
 * handle deferred crypto tailroom updates in mac80211 better
   when device restart happens
 * fix a spectre-v1 - really a continuation of a previous patch
 * advertise NL80211_CMD_UPDATE_FT_IES as supported if so
 * add some missing parsing in VHT extended NSS support
 * support HE in mac80211_hwsim
 * let mac80211 drivers determine the max MTU themselves
along with the usual cleanups etc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-11 14:57:17 +01:00
Wen Gong
06354665f9 mac80211: allow drivers to set max MTU
Make it possibly for drivers to adjust the default max_mtu
by storing it in the hardware struct and using that value
for all interfaces.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
Link: https://lore.kernel.org/r/1567738137-31748-1-git-send-email-wgong@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-09-11 09:33:29 +02:00
Pablo Neira Ayuso
be2861dc36 netfilter: nft_{fwd,dup}_netdev: add offload support
This patch adds support for packet mirroring and redirection. The
nft_fwd_dup_netdev_offload() function configures the flow_action object
for the fwd and the dup actions.

Extend nft_flow_rule_destroy() to release the net_device object when the
flow_rule object is released, since nft_fwd_dup_netdev_offload() bumps
the net_device reference counter.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: wenxu <wenxu@ucloud.cn>
2019-09-10 22:44:29 +02:00
Dirk van der Merwe
5bbd21df5a devlink: add 'reset_dev_on_drv_probe' param
Add the 'reset_dev_on_drv_probe' devlink parameter, controlling the
device reset policy on driver probe.

This parameter is useful in conjunction with the existing
'fw_load_policy' parameter.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-10 17:29:26 +01:00
Pablo Neira Ayuso
3474a2c62f netfilter: nf_tables_offload: move indirect flow_block callback logic to core
Add nft_offload_init() and nft_offload_exit() function to deal with the
init and the exit path of the offload infrastructure.

Rename nft_indr_block_get_and_ing_cmd() to nft_indr_block_cb().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-08 19:18:04 +02:00
David S. Miller
fcd8c62709 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2019-09-06

Here's the main bluetooth-next pull request for the 5.4 kernel.

 - Cleanups & fixes to btrtl driver
 - Fixes for Realtek devices in btusb, e.g. for suspend handling
 - Firmware loading support for BCM4345C5
 - hidp_send_message() return value handling fixes
 - Added support for utilizing Fast Advertising Interval
 - Various other minor cleanups & fixes

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 18:07:27 +02:00
Jiri Pirko
3dd97a0827 net: fib_notifier: move fib_notifier_ops from struct net into per-net struct
No need for fib_notifier_ops to be in struct net. It is used only by
fib_notifier as a private data. Use net_generic to introduce per-net
fib_notifier struct and move fib_notifier_ops there.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 17:28:22 +02:00
David S. Miller
b8f6a0eeb9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Add nft_reg_store64() and nft_reg_load64() helpers, from Ander Juaristi.

2) Time matching support, also from Ander Juaristi.

3) VLAN support for nfnetlink_log, from Michael Braun.

4) Support for set element deletions from the packet path, also from Ander.

5) Remove __read_mostly from conntrack spinlock, from Li RongQing.

6) Support for updating stateful objects, this also includes the initial
   client for this infrastructure: the quota extension. A follow up fix
   for the control plane also comes in this batch. Patches from
   Fernando Fernandez Mancera.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-07 16:31:30 +02:00
David S. Miller
1e46c09ec1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add the ability to use unaligned chunks in the AF_XDP umem. By
   relaxing where the chunks can be placed, it allows to use an
   arbitrary buffer size and place whenever there is a free
   address in the umem. Helps more seamless DPDK AF_XDP driver
   integration. Support for i40e, ixgbe and mlx5e, from Kevin and
   Maxim.

2) Addition of a wakeup flag for AF_XDP tx and fill rings so the
   application can wake up the kernel for rx/tx processing which
   avoids busy-spinning of the latter, useful when app and driver
   is located on the same core. Support for i40e, ixgbe and mlx5e,
   from Magnus and Maxim.

3) bpftool fixes for printf()-like functions so compiler can actually
   enforce checks, bpftool build system improvements for custom output
   directories, and addition of 'bpftool map freeze' command, from Quentin.

4) Support attaching/detaching XDP programs from 'bpftool net' command,
   from Daniel.

5) Automatic xskmap cleanup when AF_XDP socket is released, and several
   barrier/{read,write}_once fixes in AF_XDP code, from Björn.

6) Relicense of bpf_helpers.h/bpf_endian.h for future libbpf
   inclusion as well as libbpf versioning improvements, from Andrii.

7) Several new BPF kselftests for verifier precision tracking, from Alexei.

8) Several BPF kselftest fixes wrt endianess to run on s390x, from Ilya.

9) And more BPF kselftest improvements all over the place, from Stanislav.

10) Add simple BPF map op cache for nfp driver to batch dumps, from Jakub.

11) AF_XDP socket umem mapping improvements for 32bit archs, from Ivan.

12) Add BPF-to-BPF call and BTF line info support for s390x JIT, from Yauheni.

13) Small optimization in arm64 JIT to spare 1 insns for BPF_MOD, from Jerin.

14) Fix an error check in bpf_tcp_gen_syncookie() helper, from Petar.

15) Various minor fixes and cleanups, from Nathan, Masahiro, Masanari,
    Peter, Wei, Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06 16:49:17 +02:00
David S. Miller
2e9550ed67 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2019-09-05

1) Several xfrm interface fixes from Nicolas Dichtel:
   - Avoid an interface ID corruption on changelink.
   - Fix wrong intterface names in the logs.
   - Fix a list corruption when changing network namespaces.
   - Fix unregistation of the underying phydev.

2) Fix a potential warning when merging xfrm_plocy nodes.
   From Florian Westphal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-06 15:09:16 +02:00
Spoorthi Ravishankar Koppad
ad4a6795e0 Bluetooth: Add support for utilizing Fast Advertising Interval
Changes made to add support for fast advertising interval
as per core 4.1 specification, section 9.3.11.2.

A peripheral device entering any of the following GAP modes and
sending either non-connectable advertising events or scannable
undirected advertising events should use adv_fast_interval2
(100ms - 150ms) for adv_fast_period(30s).

         - Non-Discoverable Mode
         - Non-Connectable Mode
         - Limited Discoverable Mode
         - General Discoverable Mode

Signed-off-by: Spoorthi Ravishankar Koppad <spoorthix.k@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-09-05 17:27:21 +02:00
Donald Sharp
7bdf4de126 net: Properly update v4 routes with v6 nexthop
When creating a v4 route that uses a v6 nexthop from a nexthop group.
Allow the kernel to properly send the nexthop as v6 via the RTA_VIA
attribute.

Broken behavior:

$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via 254.128.0.0 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$

Fixed behavior:

$ ip nexthop add via fe80::9 dev eth0
$ ip nexthop show
id 1 via fe80::9 dev eth0 scope link
$ ip route add 4.5.6.7/32 nhid 1
$ ip route show
default via 10.0.2.2 dev eth0
4.5.6.7 nhid 1 via inet6 fe80::9 dev eth0
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15
$

v2, v3: Addresses code review comments from David Ahern

Fixes: dcb1ecb50e (“ipv4: Prepare for fib6_nh from a nexthop object”)
Signed-off-by: Donald Sharp <sharpd@cumulusnetworks.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 12:35:58 +02:00
David S. Miller
44c40910b6 linux-can-next-for-5.4-20190904
-----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCgAxFiEEmvEkXzgOfc881GuFWsYho5HknSAFAl1vrJITHG1rbEBwZW5n
 dXRyb25peC5kZQAKCRBaxiGjkeSdIC8BB/98XcWiaInD+SM6UjD2dVd1r0zhPKJS
 WBK58G81+op3YP4DY8Iy+C24uZBlSlutVGoD/PIrZF39xXsnOtJuMVHC4LvtdADC
 30uI/61JQNEjuX2AiTFudqDvYjZZKZ28HLqEnO2pWk3dMVL3+fkS3i7VQR7KJ/Gr
 BYM6EzCdkbuWW/zsAVbKLJ8NswVmcdjP7eSK+exKppoWMtgCglZw1X6iP5YXDnbK
 h3dGs687u8RfUra7j7vgnJzyQU4draMPsabaLDT5qw1PgYQ3k8MTVMBlULR0+HHO
 qkBqumRwfOxay0z0XOgRuWrICKTH/b0SRLp3H53ZyfDo6+4TC9KGHRgX
 =gwfZ
 -----END PGP SIGNATURE-----

Merge tag 'linux-can-next-for-5.4-20190904' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can-next

Marc Kleine-Budde says:

====================
pull-request: can-next 2019-09-04 j1939

this is a pull request for net-next/master consisting of 21 patches.

the first 12 patches are by me and target the CAN core infrastructure.
They clean up the names of variables , structs and struct members,
convert can_rx_register() to use max() instead of open coding it and
remove unneeded code from the can_pernet_exit() callback.

The next three patches are also by me and they introduce and make use of
the CAN midlayer private structure. It is used to hold protocol specific
per device data structures.

The next patch is by Oleksij Rempel, switches the
&net->can.rcvlists_lock from a spin_lock() to a spin_lock_bh(), so that
it can be used from NAPI (soft IRQ) context.

The next 4 patches are by Kurt Van Dijck, he first updates his email
address via mailmap and then extends sockaddr_can to include j1939
members.

The final patch is the collective effort of many entities (The j1939
authors: Oliver Hartkopp, Bastian Stender, Elenita Hinds, kbuild test
robot, Kurt Van Dijck, Maxime Jayat, Robin van der Gracht, Oleksij
Rempel, Marc Kleine-Budde). It adds support of SAE J1939 protocol to the
CAN networking stack.

SAE J1939 is the vehicle bus recommended practice used for communication
and diagnostics among vehicle components. Originating in the car and
heavy-duty truck industry in the United States, it is now widely used in
other parts of the world.

P.S.: This pull request doesn't invalidate my last pull request:
      "pull-request: can-next 2019-09-03".
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 12:17:50 +02:00
Jakub Kicinski
be2fbc155f net/tls: clean up the number of #ifdefs for CONFIG_TLS_DEVICE
TLS code has a number of #ifdefs which make the code a little
harder to follow. Recent fixes removed the ifdef around the
TLS_HW define, so we can switch to the often used pattern
of defining tls_device functions as empty static inlines
in the header when CONFIG_TLS_DEVICE=n.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:49:49 +02:00
Jakub Kicinski
be7bbea114 net/tls: use the full sk_proto pointer
Since we already have the pointer to the full original sk_proto
stored use that instead of storing all individual callback
pointers as well.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:49:49 +02:00
Dave Taht
842841ece5 Convert usage of IN_MULTICAST to ipv4_is_multicast
IN_MULTICAST's primary intent is as a uapi macro.

Elsewhere in the kernel we use ipv4_is_multicast consistently.

This patch unifies linux's multicast checks to use that function
rather than this macro.

Signed-off-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:38:32 +02:00
Shannon Nelson
7d5aa9a524 devlink: Add new info version tags for ASIC and FW
The current tag set is still rather small and needs a couple
more tags to help with ASIC identification and to have a
more generic FW version.

Cc: Jiri Pirko <jiri@resnulli.us>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Shannon Nelson <snelson@pensando.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-05 09:24:43 +02:00
Marc Kleine-Budde
564577dfee can: netns: remove "can_" prefix from members struct netns_can
This patch improves the code reability by removing the redundant "can_"
prefix from the members of struct netns_can (as the struct netns_can itself
is the member "can" of the struct net.)

The conversion is done with:

	sed -i \
		-e "s/struct can_dev_rcv_lists \*can_rx_alldev_list;/struct can_dev_rcv_lists *rx_alldev_list;/" \
		-e "s/spinlock_t can_rcvlists_lock;/spinlock_t rcvlists_lock;/" \
		-e "s/struct timer_list can_stattimer;/struct timer_list stattimer; /" \
		-e "s/can\.can_rx_alldev_list/can.rx_alldev_list/g" \
		-e "s/can\.can_rcvlists_lock/can.rcvlists_lock/g" \
		-e "s/can\.can_stattimer/can.stattimer/g" \
		include/net/netns/can.h \
		net/can/*.[ch]

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-09-04 13:29:14 +02:00
Marc Kleine-Budde
2341086df4 can: netns: give members of struct netns_can holding the statistics a sensible name
This patch gives the members of the struct netns_can that are holding
the statistics a sensible name, by renaming struct netns_can::can_stats
into struct netns_can::pkg_stats and struct netns_can::can_pstats into
struct netns_can::rcv_lists_stats.

The conversion is done with:

	sed -i \
		-e "s:\(struct[^*]*\*\)can_stats;.*:\1pkg_stats;:" \
		-e "s:\(struct[^*]*\*\)can_pstats;.*:\1rcv_lists_stats;:" \
		-e "s/can\.can_stats/can.pkg_stats/g" \
		-e "s/can\.can_pstats/can.rcv_lists_stats/g" \
		net/can/*.[ch] \
		include/net/netns/can.h

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-09-04 13:29:13 +02:00
Marc Kleine-Budde
6c43bb3a41 can: netns: give structs holding the CAN statistics a sensible name
This patch renames both "struct s_stats" and "struct s_pstats", to
"struct can_pkg_stats" and "struct can_rcv_lists_stats" to better
reflect their meaning and improve code readability.

The conversion is done with:

	sed -i \
		-e "s/struct s_stats/struct can_pkg_stats/g" \
		-e "s/struct s_pstats/struct can_rcv_lists_stats/g" \
		net/can/*.[ch] \
		include/net/netns/can.h

Signed-off-by: Oleksij Rempel <o.rempel@pengutronix.de>
Acked-by: Oliver Hartkopp <socketcan@hartkopp.net>
Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
2019-09-04 13:29:13 +02:00
Fernando Fernandez Mancera
d62d0ba97b netfilter: nf_tables: Introduce stateful object update operation
This patch adds the infrastructure needed for the stateful object update
support.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-09-03 19:01:25 +02:00
David S. Miller
765b7590c9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
r8152 conflicts are the NAPI fixes in 'net' overlapping with
some tasklet stuff in net-next

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-09-02 11:20:17 -07:00
Parav Pandit
c7282b501f devlink: Make port index data type as unsigned int
Devlink port index attribute is returned to users as u32 through
netlink response.
Change index data type from 'unsigned' to 'unsigned int' to avoid
below checkpatch.pl warning.

WARNING: Prefer 'unsigned int' to bare use of 'unsigned'
81: FILE: include/net/devlink.h:81:
+       unsigned index;

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31 23:46:13 -07:00
Davide Caratti
26811cc9f5 net: tls: export protocol version, cipher, tx_conf/rx_conf to socket diag
When an application configures kernel TLS on top of a TCP socket, it's
now possible for inet_diag_handler() to collect information regarding the
protocol version, the cipher type and TX / RX configuration, in case
INET_DIAG_INFO is requested.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31 23:44:28 -07:00
Davide Caratti
61723b3932 tcp: ulp: add functions to dump ulp-specific information
currently, only getsockopt(TCP_ULP) can be invoked to know if a ULP is on
top of a TCP socket. Extend idiag_get_aux() and idiag_get_aux_size(),
introduced by commit b37e88407c ("inet_diag: allow protocols to provide
additional data"), to report the ULP name and other information that can
be made available by the ULP through optional functions.

Users having CAP_NET_ADMIN privileges will then be able to retrieve this
information through inet_diag_handler, if they specify INET_DIAG_INFO in
the request.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31 23:44:28 -07:00
Jakub Kicinski
15a7dea750 net/tls: use RCU protection on icsk->icsk_ulp_data
We need to make sure context does not get freed while diag
code is interrogating it. Free struct tls_context with
kfree_rcu().

We add the __rcu annotation directly in icsk, and cast it
away in the datapath accessor. Presumably all ULPs will
do a similar thing.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-31 23:44:28 -07:00
Denis Efremov
974ceb21fc udp: Remove unlikely() from IS_ERR*() condition
"unlikely(IS_ERR_OR_NULL(x))" is excessive. IS_ERR_OR_NULL() already uses
unlikely() internally.

Signed-off-by: Denis Efremov <efremov@linux.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Joe Perches <joe@perches.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-30 19:49:37 -07:00
Kevin Laatz
c05cd36458 xsk: add support to allow unaligned chunk placement
Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be placed
at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing with
aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-31 01:08:26 +02:00
Felix Fietkau
c8cd6e7f15 cfg80211: add local BSS receive time to survey information
This is useful for checking how much airtime is being used up by other
transmissions on the channel, e.g. by calculating (time_rx - time_bss_rx)
or (time_busy - time_bss_rx - time_tx)

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://lore.kernel.org/r/20190828102042.58016-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-08-30 12:28:44 +02:00
Vlad Buslov
dbf47a2a09 net: sched: act_sample: fix psample group handling on overwrite
Action sample doesn't properly handle psample_group pointer in overwrite
case. Following issues need to be fixed:

- In tcf_sample_init() function RCU_INIT_POINTER() is used to set
  s->psample_group, even though we neither setting the pointer to NULL, nor
  preventing concurrent readers from accessing the pointer in some way.
  Use rcu_swap_protected() instead to safely reset the pointer.

- Old value of s->psample_group is not released or deallocated in any way,
  which results resource leak. Use psample_group_put() on non-NULL value
  obtained with rcu_swap_protected().

- The function psample_group_put() that released reference to struct
  psample_group pointed by rcu-pointer s->psample_group doesn't respect rcu
  grace period when deallocating it. Extend struct psample_group with rcu
  head and use kfree_rcu when freeing it.

Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-28 15:53:51 -07:00
Eric Dumazet
14105c191e ipv6: shrink struct ipv6_mc_socklist
Remove two holes on 64bit arches, to bring the size
to one cache line exactly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-28 14:43:03 -07:00
Xin Long
1b0b8114b9 sctp: make ecn flag per netns and endpoint
This patch is to add ecn flag for both netns_sctp and sctp_endpoint,
net->sctp.ecn_enable is set 1 by default, and ep->ecn_enable will
be initialized with net->sctp.ecn_enable.

asoc->peer.ecn_capable will be set during negotiation only when
ep->ecn_enable is set on both sides.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27 20:54:14 -07:00
Vivien Didelot
e65d45cc35 net: dsa: remove bitmap operations
The bitmap operations were introduced to simplify the switch drivers
in the future, since most of them could implement the common VLAN and
MDB operations (add, del, dump) with simple functions taking all target
ports at once, and thus limiting the number of hardware accesses.

Programming an MDB or VLAN this way in a single operation would clearly
simplify the drivers a lot but would require a new get-set interface
in DSA. The usage of such bitmap from the stack also raised concerned
in the past, leading to the dynamic allocation of a new ds->_bitmap
member in the dsa_switch structure. So let's get rid of them for now.

This commit nicely wraps the ds->ops->port_{mdb,vlan}_{prepare,add}
switch operations into new dsa_switch_{mdb,vlan}_{prepare,add}
variants not using any bitmap argument anymore.

New dsa_switch_{mdb,vlan}_match helpers have been introduced to make
clear which local port of a switch must be programmed with the target
object. While the targeted user port is an obvious candidate, the
DSA links must also be programmed, as well as the CPU port for VLANs.

While at it, also remove local variables that are only used once.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27 20:17:27 -07:00
Cong Wang
981471bd3a net_sched: fix a NULL pointer deref in ipt action
The net pointer in struct xt_tgdtor_param is not explicitly
initialized therefore is still NULL when dereferencing it.
So we have to find a way to pass the correct net pointer to
ipt_destroy_target().

The best way I find is just saving the net pointer inside the per
netns struct tcf_idrinfo, which could make this patch smaller.

Fixes: 0c66dc1ea3 ("netfilter: conntrack: register hooks in netns when needed by ruleset")
Reported-and-tested-by: itugrok@yahoo.com
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27 15:05:58 -07:00
David S. Miller
68aaf44595 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Minor conflict in r8169, bug fix had two versions in net
and net-next, take the net-next hunks.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-27 14:23:31 -07:00
Ander Juaristi
d0a8d877da netfilter: nft_dynset: support for element deletion
This patch implements the delete operation from the ruleset.

It implements a new delete() function in nft_set_rhash. It is simpler
to use than the already existing remove(), because it only takes the set
and the key as arguments, whereas remove() expects a full
nft_set_elem structure.

Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-27 17:27:08 +02:00
Vlad Buslov
1444c175a3 net: sched: copy tunnel info when setting flow_action entry->tunnel
In order to remove dependency on rtnl lock, modify tc_setup_flow_action()
to copy tunnel info, instead of just saving pointer to tunnel_key action
tunnel info. This is necessary to prevent concurrent action overwrite from
releasing tunnel info while it is being used by rtnl-unlocked driver.

Implement helper tcf_tunnel_info_copy() that is used to copy tunnel info
with all its options to dynamically allocated memory block. Modify
tc_cleanup_flow_action() to free dynamically allocated tunnel info.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
5a6ff4b13d net: sched: take reference to action dev before calling offloads
In order to remove dependency on rtnl lock when calling hardware offload
API, take reference to action mirred dev when initializing flow_action
structure in tc_setup_flow_action(). Implement function
tc_cleanup_flow_action(), use it to release the device after hardware
offload API is done using it.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
9838b20a7f net: sched: take rtnl lock in tc_setup_flow_action()
In order to allow using new flow_action infrastructure from unlocked
classifiers, modify tc_setup_flow_action() to accept new 'rtnl_held'
argument. Take rtnl lock before accessing tc_action data. This is necessary
to protect from concurrent action replace.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
c9f14470d0 net: sched: add API for registering unlocked offload block callbacks
Extend struct flow_block_offload with "unlocked_driver_cb" flag to allow
registering and unregistering block hardware offload callbacks that do not
require caller to hold rtnl lock. Extend tcf_block with additional
lockeddevcnt counter that is incremented for each non-unlocked driver
callback attached to device. This counter is necessary to conditionally
obtain rtnl lock before calling hardware callbacks in following patches.

Register mlx5 tc block offload callbacks as "unlocked".

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
a449a3e77a net: sched: notify classifier on successful offload add/delete
To remove dependency on rtnl lock, extend classifier ops with new
ops->hw_add() and ops->hw_del() callbacks. Call them from cls API while
holding cb_lock every time filter if successfully added to or deleted from
hardware.

Implement the new API in flower classifier. Use it to manage hw_filters
list under cb_lock protection, instead of relying on rtnl lock to
synchronize with concurrent fl_reoffload() call.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
4011921137 net: sched: refactor block offloads counter usage
Without rtnl lock protection filters can no longer safely manage block
offloads counter themselves. Refactor cls API to protect block offloadcnt
with tcf_block->cb_lock that is already used to protect driver callback
list and nooffloaddevcnt counter. The counter can be modified by concurrent
tasks by new functions that execute block callbacks (which is safe with
previous patch that changed its type to atomic_t), however, block
bind/unbind code that checks the counter value takes cb_lock in write mode
to exclude any concurrent modifications. This approach prevents race
conditions between bind/unbind and callback execution code but allows for
concurrency for tc rule update path.

Move block offload counter, filter in hardware counter and filter flags
management from classifiers into cls hardware offloads API. Make functions
tcf_block_offload_{inc|dec}() and tc_cls_offload_cnt_update() to be cls API
private. Implement following new cls API to be used instead:

  tc_setup_cb_add() - non-destructive filter add. If filter that wasn't
  already in hardware is successfully offloaded, increment block offloads
  counter, set filter in hardware counter and flag. On failure, previously
  offloaded filter is considered to be intact and offloads counter is not
  decremented.

  tc_setup_cb_replace() - destructive filter replace. Release existing
  filter block offload counter and reset its in hardware counter and flag.
  Set new filter in hardware counter and flag. On failure, previously
  offloaded filter is considered to be destroyed and offload counter is
  decremented.

  tc_setup_cb_destroy() - filter destroy. Unconditionally decrement block
  offloads counter.

  tc_setup_cb_reoffload() - reoffload filter to single cb. Execute cb() and
  call tc_cls_offload_cnt_update() if cb() didn't return an error.

Refactor all offload-capable classifiers to atomically offload filters to
hardware, change block offload counter, and set filter in hardware counter
and flag by means of the new cls API functions.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
97394bef56 net: sched: change tcf block offload counter type to atomic_t
As a preparation for running proto ops functions without rtnl lock, change
offload counter type to atomic. This is necessary to allow updating the
counter by multiple concurrent users when offloading filters to hardware
from unlocked classifiers.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Vlad Buslov
4f8116c850 net: sched: protect block offload-related fields with rw_semaphore
In order to remove dependency on rtnl lock, extend tcf_block with 'cb_lock'
rwsem and use it to protect flow_block->cb_list and related counters from
concurrent modification. The lock is taken in read mode for read-only
traversal of cb_list in tc_setup_cb_call() and write mode in all other
cases. This approach ensures that:

- cb_list is not changed concurrently while filters is being offloaded on
  block.

- block->nooffloaddevcnt is checked while holding the lock in read mode,
  but is only changed by bind/unbind code when holding the cb_lock in write
  mode to prevent concurrent modification.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-26 14:17:43 -07:00
Ander Juaristi
a1b840adaf netfilter: nf_tables: Introduce new 64-bit helper register functions
Introduce new helper functions to load/store 64-bit values onto/from
registers:

 - nft_reg_store64
 - nft_reg_load64

This commit also re-orders all these helpers from smallest to largest
target bit size.

Signed-off-by: Ander Juaristi <a@juaristi.eus>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-26 11:01:00 +02:00
David Ahern
9b5f684182 nexthop: Fix nexthop_num_path for blackhole nexthops
Donald reported this sequence:
  ip next add id 1 blackhole
  ip next add id 2 blackhole
  ip ro add 1.1.1.1/32 nhid 1
  ip ro add 1.1.1.2/32 nhid 2

would cause a crash. Backtrace is:

[  151.302790] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  151.304043] CPU: 1 PID: 277 Comm: ip Not tainted 5.3.0-rc5+ #37
[  151.305078] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1 04/01/2014
[  151.306526] RIP: 0010:fib_add_nexthop+0x8b/0x2aa
[  151.307343] Code: 35 f7 81 48 8d 14 01 c7 02 f1 f1 f1 f1 c7 42 04 01 f4 f4 f4 48 89 f2 48 c1 ea 03 65 48 8b 0c 25 28 00 00 00 48 89 4d d0 31 c9 <80> 3c 02 00 74 08 48 89 f7 e8 1a e8 53 ff be 08 00 00 00 4c 89 e7
[  151.310549] RSP: 0018:ffff888116c27340 EFLAGS: 00010246
[  151.311469] RAX: dffffc0000000000 RBX: ffff8881154ece00 RCX: 0000000000000000
[  151.312713] RDX: 0000000000000004 RSI: 0000000000000020 RDI: ffff888115649b40
[  151.313968] RBP: ffff888116c273d8 R08: ffffed10221e3757 R09: ffff888110f1bab8
[  151.315212] R10: 0000000000000001 R11: ffff888110f1bab3 R12: ffff888115649b40
[  151.316456] R13: 0000000000000020 R14: ffff888116c273b0 R15: ffff888115649b40
[  151.317707] FS:  00007f60b4d8d800(0000) GS:ffff88811ac00000(0000) knlGS:0000000000000000
[  151.319113] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  151.320119] CR2: 0000555671ffdc00 CR3: 00000001136ba005 CR4: 0000000000020ee0
[  151.321367] Call Trace:
[  151.321820]  ? fib_nexthop_info+0x635/0x635
[  151.322572]  fib_dump_info+0xaa4/0xde0
[  151.323247]  ? fib_create_info+0x2431/0x2431
[  151.324008]  ? napi_alloc_frag+0x2a/0x2a
[  151.324711]  rtmsg_fib+0x2c4/0x3be
[  151.325339]  fib_table_insert+0xe2f/0xeee
...

fib_dump_info incorrectly has nhs = 0 for blackhole nexthops, so it
believes the nexthop object is a multipath group (nhs != 1) and ends
up down the nexthop_mpath_fill_node() path which is wrong for a
blackhole.

The blackhole check in nexthop_num_path is leftover from early days
of the blackhole implementation which did not initialize the device.
In the end the design was simpler (fewer special case checks) to set
the device to loopback in nh_info, so the check in nexthop_num_path
should have been removed.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Reported-by: Donald Sharp <sharpd@cumulusnetworks.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-25 14:29:10 -07:00
John Fastabend
e93fb3e952 net: route dump netlink NLM_F_MULTI flag missing
An excerpt from netlink(7) man page,

  In multipart messages (multiple nlmsghdr headers with associated payload
  in one byte stream) the first and all following headers have the
  NLM_F_MULTI flag set, except for the last  header  which  has the type
  NLMSG_DONE.

but, after (ee28906) there is a missing NLM_F_MULTI flag in the middle of a
FIB dump. The result is user space applications following above man page
excerpt may get confused and may stop parsing msg believing something went
wrong.

In the golang netlink lib [0] the library logic stops parsing believing the
message is not a multipart message. Found this running Cilium[1] against
net-next while adding a feature to auto-detect routes. I noticed with
multiple route tables we no longer could detect the default routes on net
tree kernels because the library logic was not returning them.

Fix this by handling the fib_dump_info_fnhe() case the same way the
fib_dump_info() handles it by passing the flags argument through the
call chain and adding a flags argument to rt_fill_info().

Tested with Cilium stack and auto-detection of routes works again. Also
annotated libs to dump netlink msgs and inspected NLM_F_MULTI and
NLMSG_DONE flags look correct after this.

Note: In inet_rtm_getroute() pass rt_fill_info() '0' for flags the same
as is done for fib_dump_info() so this looks correct to me.

[0] https://github.com/vishvananda/netlink/
[1] https://github.com/cilium/

Fixes: ee28906fd7 ("ipv4: Dump route exceptions if requested")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-24 16:49:48 -07:00
Dmytro Linkin
7a978759b4 net/mlx5e: Add tc flower tracepoints
Implemented following tracepoints:
1. Configure flower (mlx5e_configure_flower)
2. Delete flower (mlx5e_delete_flower)
3. Stats flower (mlx5e_stats_flower)

Usage example:
 ># cd /sys/kernel/debug/tracing
 ># echo mlx5:mlx5e_configure_flower >> set_event
 ># cat trace
    ...
    tc-6535  [019] ...1  2672.404466: mlx5e_configure_flower: cookie=0000000067874a55 actions= REDIRECT

Added corresponding documentation in
    Documentation/networking/device-driver/mellanox/mlx5.rst

Signed-off-by: Dmytro Linkin <dmitrolin@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-08-21 15:55:17 -07:00
Mike Rapoport
aad12c2394 trivial: netns: fix typo in 'struct net.passive' description
Replace 'decided' with 'decide' so that comment would be

/* To decide when the network namespace should be freed. */

Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-21 13:07:29 -07:00
Alexei Avshalom Lazar
2a38075cd0 nl80211: Add support for EDMG channels
802.11ay specification defines Enhanced Directional Multi-Gigabit
(EDMG) STA and AP which allow channel bonding of 2 channels and more.

Introduce new NL attributes that are needed for enabling and
configuring EDMG support.

Two new attributes are used by kernel to publish driver's EDMG
capabilities to the userspace:
NL80211_BAND_ATTR_EDMG_CHANNELS - bitmap field that indicates the 2.16
GHz channel(s) that are supported by the driver.
When this attribute is not set it means driver does not support EDMG.
NL80211_BAND_ATTR_EDMG_BW_CONFIG - represent the channel bandwidth
configurations supported by the driver.

Additional two new attributes are used by the userspace for connect
command and for AP configuration:
NL80211_ATTR_WIPHY_EDMG_CHANNELS
NL80211_ATTR_WIPHY_EDMG_BW_CONFIG

New rate info flag - RATE_INFO_FLAGS_EDMG, can be reported from driver
and used for bitrate calculation that will take into account EDMG
according to the 802.11ay specification.

Signed-off-by: Alexei Avshalom Lazar <ailizaro@codeaurora.org>
Link: https://lore.kernel.org/r/1566138918-3823-2-git-send-email-ailizaro@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-08-21 11:07:35 +02:00
Ben Greear
6c7a00339e cfg80211: Support assoc-at timer in sta-info
Report timestamp of when sta became associated.

This is the boottime clock, units are nano-seconds.

Signed-off-by: Ben Greear <greearb@candelatech.com>
Link: https://lore.kernel.org/r/20190809180001.26393-1-greearb@candelatech.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-08-21 10:56:42 +02:00
Xin Long
03f961270f sctp: add sctp_auth_init and sctp_auth_free
This patch is to factor out sctp_auth_init and sctp_auth_free
functions, and sctp_auth_init will also be used in the next
patch for SCTP_AUTH_SUPPORTED sockopt.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 18:27:29 -07:00
Xin Long
4e27428fb5 sctp: add asconf_enable in struct sctp_endpoint
This patch is to make addip/asconf flag per endpoint,
and its value is initialized by the per netns flag,
net->sctp.addip_enable.

It also replaces the checks of net->sctp.addip_enable
with ep->asconf_enable in some places.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 18:27:28 -07:00
Stefano Brivio
3a7ef457e8 ipv6: Fix return value of ipv6_mc_may_pull() for malformed packets
Commit ba5ea61462 ("bridge: simplify ip_mc_check_igmp() and
ipv6_mc_check_mld() calls") replaces direct calls to pskb_may_pull()
in br_ipv6_multicast_mld2_report() with calls to ipv6_mc_may_pull(),
that returns -EINVAL on buffers too short to be valid IPv6 packets,
while maintaining the previous handling of the return code.

This leads to the direct opposite of the intended effect: if the
packet is malformed, -EINVAL evaluates as true, and we'll happily
proceed with the processing.

Return 0 if the packet is too short, in the same way as this was
fixed for IPv4 by commit 083b78a9ed ("ip: fix ip_mc_may_pull()
return value").

I don't have a reproducer for this, unlike the one referred to by
the IPv4 commit, but this is clearly broken.

Fixes: ba5ea61462 ("bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Acked-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 17:19:46 -07:00
David S. Miller
446bf64b61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Merge conflict of mlx5 resolved using instructions in merge
commit 9566e650bf.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-19 11:54:03 -07:00
Pablo Neira Ayuso
3bc158f8d0 netfilter: nf_tables: map basechain priority to hardware priority
This patch adds initial support for offloading basechains using the
priority range from 1 to 65535. This is restricting the netfilter
priority range to 16-bit integer since this is what most drivers assume
so far from tc. It should be possible to extend this range of supported
priorities later on once drivers are updated to support for 32-bit
integer priorities.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18 14:13:23 -07:00
Pablo Neira Ayuso
ef01adae0e net: sched: use major priority number as hardware priority
tc transparently maps the software priority number to hardware. Update
it to pass the major priority which is what most drivers expect. Update
drivers too so they do not need to lshift the priority field of the
flow_cls_common_offload object. The stmmac driver is an exception, since
this code assumes the tc software priority is fine, therefore, lshift it
just to be conservative.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-18 14:13:23 -07:00
Björn Töpel
0402acd683 xsk: remove AF_XDP socket from map when the socket is released
When an AF_XDP socket is released/closed the XSKMAP still holds a
reference to the socket in a "released" state. The socket will still
use the netdev queue resource, and block newly created sockets from
attaching to that queue, but no user application can access the
fill/complete/rx/tx queues. This results in that all applications need
to explicitly clear the map entry from the old "zombie state"
socket. This should be done automatically.

In this patch, the sockets tracks, and have a reference to, which maps
it resides in. When the socket is released, it will remove itself from
all maps.

Suggested-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:24:45 +02:00
Stanislav Fomichev
8f51dfc73b bpf: support cloning sk storage on accept()
Add new helper bpf_sk_storage_clone which optionally clones sk storage
and call it from sk_clone_lock.

Cc: Martin KaFai Lau <kafai@fb.com>
Cc: Yonghong Song <yhs@fb.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Yonghong Song <yhs@fb.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:18:54 +02:00
Magnus Karlsson
77cd0d7b3f xsk: add support for need_wakeup flag in AF_XDP rings
This commit adds support for a new flag called need_wakeup in the
AF_XDP Tx and fill rings. When this flag is set, it means that the
application has to explicitly wake up the kernel Rx (for the bit in
the fill ring) or kernel Tx (for bit in the Tx ring) processing by
issuing a syscall. Poll() can wake up both depending on the flags
submitted and sendto() will wake up tx processing only.

The main reason for introducing this new flag is to be able to
efficiently support the case when application and driver is executing
on the same core. Previously, the driver was just busy-spinning on the
fill ring if it ran out of buffers in the HW and there were none on
the fill ring. This approach works when the application is running on
another core as it can replenish the fill ring while the driver is
busy-spinning. Though, this is a lousy approach if both of them are
running on the same core as the probability of the fill ring getting
more entries when the driver is busy-spinning is zero. With this new
feature the driver now sets the need_wakeup flag and returns to the
application. The application can then replenish the fill queue and
then explicitly wake up the Rx processing in the kernel using the
syscall poll(). For Tx, the flag is only set to one if the driver has
no outstanding Tx completion interrupts. If it has some, the flag is
zero as it will be woken up by a completion interrupt anyway.

As a nice side effect, this new flag also improves the performance of
the case where application and driver are running on two different
cores as it reduces the number of syscalls to the kernel. The kernel
tells user space if it needs to be woken up by a syscall, and this
eliminates many of the syscalls.

This flag needs some simple driver support. If the driver does not
support this, the Rx flag is always zero and the Tx flag is always
one. This makes any application relying on this feature default to the
old behaviour of not requiring any syscalls in the Rx path and always
having to call sendto() in the Tx path.

For backwards compatibility reasons, this feature has to be explicitly
turned on using a new bind flag (XDP_USE_NEED_WAKEUP). I recommend
that you always turn it on as it so far always have had a positive
performance impact.

The name and inspiration of the flag has been taken from io_uring by
Jens Axboe. Details about this feature in io_uring can be found in
http://kernel.dk/io_uring.pdf, section 8.3.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-08-17 23:07:32 +02:00
Ido Schimmel
f3047ca01f Documentation: Add devlink-trap documentation
Add initial documentation of the devlink-trap mechanism, explaining the
background, motivation and the semantics of the interface.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17 12:40:08 -07:00
Ido Schimmel
391203ab11 devlink: Add generic packet traps and groups
Add generic packet traps and groups that can report dropped packets as
well as exceptions such as TTL error.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17 12:40:08 -07:00
Ido Schimmel
0f420b6c52 devlink: Add packet trap infrastructure
Add the basic packet trap infrastructure that allows device drivers to
register their supported packet traps and trap groups with devlink.

Each driver is expected to provide basic information about each
supported trap, such as name and ID, but also the supported metadata
types that will accompany each packet trapped via the trap. The
currently supported metadata type is just the input port, but more will
be added in the future. For example, output port and traffic class.

Trap groups allow users to set the action of all member traps. In
addition, users can retrieve per-group statistics in case per-trap
statistics are too narrow. In the future, the trap group object can be
extended with more attributes, such as policer settings which will limit
the amount of traffic generated by member traps towards the CPU.

Beside registering their packet traps with devlink, drivers are also
expected to report trapped packets to devlink along with relevant
metadata. devlink will maintain packets and bytes statistics for each
packet trap and will potentially report the trapped packet with its
metadata to user space via drop monitor netlink channel.

The interface towards the drivers is simple and allows devlink to set
the action of the trap. Currently, only two actions are supported:
'trap' and 'drop'. When set to 'trap', the device is expected to provide
the sole copy of the packet to the driver which will pass it to devlink.
When set to 'drop', the device is expected to drop the packet and not
send a copy to the driver. In the future, more actions can be added,
such as 'mirror'.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17 12:40:08 -07:00
Ido Schimmel
edd3d0074c drop_monitor: Add basic infrastructure for hardware drops
Export a function that can be invoked in order to report packets that
were dropped by the underlying hardware along with metadata.

Subsequent patches will add support for the different alert modes.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17 12:40:08 -07:00
David S. Miller
42eb455470 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2019-08-17

Here's a set of Bluetooth fixes for the 5.3-rc series:

 - Multiple fixes for Qualcomm (btqca & hci_qca) drivers
 - Minimum encryption key size debugfs setting (this is required for
   Bluetooth Qualification)
 - Fix hidp_send_message() to have a meaningful return value
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-17 12:37:47 -07:00
Marcel Holtmann
58a96fc353 Bluetooth: Add debug setting for changing minimum encryption key size
For testing and qualification purposes it is useful to allow changing
the minimum encryption key size value that the host stack is going to
enforce. This adds a new debugfs setting min_encrypt_key_size to achieve
this functionality.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2019-08-17 13:54:40 +03:00
David S. Miller
12ed601513 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

This patchset contains Netfilter fixes for net:

1) Extend selftest to cover flowtable with ipsec, from Florian Westphal.

2) Fix interaction of ipsec with flowtable, also from Florian.

3) User-after-free with bound set to rule that fails to load.

4) Adjust state and timeout for flows that expire.

5) Timeout update race with flows in teardown state.

6) Ensure conntrack id hash calculation use invariants as input,
   from Dirk Morris.

7) Do not push flows into flowtable for TCP fin/rst packets.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-15 14:01:14 -07:00
David Ahern
d00ee64e1d netlink: Fix nlmsg_parse as a wrapper for strict message parsing
Eric reported a syzbot warning:

BUG: KMSAN: uninit-value in nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
CPU: 0 PID: 11812 Comm: syz-executor444 Not tainted 5.3.0-rc3+ #17
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x191/0x1f0 lib/dump_stack.c:113
 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan_report.c:109
 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:294
 nh_valid_get_del_req+0x6f1/0x8c0 net/ipv4/nexthop.c:1510
 rtm_del_nexthop+0x1b1/0x610 net/ipv4/nexthop.c:1543
 rtnetlink_rcv_msg+0x115a/0x1580 net/core/rtnetlink.c:5223
 netlink_rcv_skb+0x431/0x620 net/netlink/af_netlink.c:2477
 rtnetlink_rcv+0x50/0x60 net/core/rtnetlink.c:5241
 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline]
 netlink_unicast+0xf6c/0x1050 net/netlink/af_netlink.c:1328
 netlink_sendmsg+0x110f/0x1330 net/netlink/af_netlink.c:1917
 sock_sendmsg_nosec net/socket.c:637 [inline]
 sock_sendmsg net/socket.c:657 [inline]
 ___sys_sendmsg+0x14ff/0x1590 net/socket.c:2311
 __sys_sendmmsg+0x53a/0xae0 net/socket.c:2413
 __do_sys_sendmmsg net/socket.c:2442 [inline]
 __se_sys_sendmmsg+0xbd/0xe0 net/socket.c:2439
 __x64_sys_sendmmsg+0x56/0x70 net/socket.c:2439
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:297
 entry_SYSCALL_64_after_hwframe+0x63/0xe7

The root cause is nlmsg_parse calling __nla_parse which means the
header struct size is not checked.

nlmsg_parse should be a wrapper around __nlmsg_parse with
NL_VALIDATE_STRICT for the validate argument very much like
nlmsg_parse_deprecated is for NL_VALIDATE_LIBERAL.

Fixes: 3de6440354 ("netlink: re-add parse/validate functions in strict mode")
Reported-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 20:37:16 -07:00
Jakub Kicinski
c162610c7d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Rename mss field to mss_option field in synproxy, from Fernando Mancera.

2) Use SYSCTL_{ZERO,ONE} definitions in conntrack, from Matteo Croce.

3) More strict validation of IPVS sysctl values, from Junwei Hu.

4) Remove unnecessary spaces after on the right hand side of assignments,
   from yangxingwu.

5) Add offload support for bitwise operation.

6) Extend the nft_offload_reg structure to store immediate date.

7) Collapse several ip_set header files into ip_set.h, from
   Jeremy Sowden.

8) Make netfilter headers compile with CONFIG_KERNEL_HEADER_TEST=y,
   from Jeremy Sowden.

9) Fix several sparse warnings due to missing prototypes, from
   Valdis Kletnieks.

10) Use static lock initialiser to ensure connlabel spinlock is
    initialized on boot time to fix sched/act_ct.c, patch
    from Florian Westphal.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 18:22:57 -07:00
Jakub Kicinski
708852dcac Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
The following pull-request contains BPF updates for your *net-next* tree.

There is a small merge conflict in libbpf (Cc Andrii so he's in the loop
as well):

        for (i = 1; i <= btf__get_nr_types(btf); i++) {
                t = (struct btf_type *)btf__type_by_id(btf, i);

                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
  <<<<<<< HEAD
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t+1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && kind == BTF_KIND_DATASEC) {
  =======
                        t->size = sizeof(int);
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 32);
                } else if (!has_datasec && btf_is_datasec(t)) {
  >>>>>>> 72ef80b5ee
                        /* replace DATASEC with STRUCT */

Conflict is between the two commits 1d4126c4e1 ("libbpf: sanitize VAR to
conservative 1-byte INT") and b03bc6853c ("libbpf: convert libbpf code to
use new btf helpers"), so we need to pick the sanitation fixup as well as
use the new btf_is_datasec() helper and the whitespace cleanup. Looks like
the following:

  [...]
                if (!has_datasec && btf_is_var(t)) {
                        /* replace VAR with INT */
                        t->info = BTF_INFO_ENC(BTF_KIND_INT, 0, 0);
                        /*
                         * using size = 1 is the safest choice, 4 will be too
                         * big and cause kernel BTF validation failure if
                         * original variable took less than 4 bytes
                         */
                        t->size = 1;
                        *(int *)(t + 1) = BTF_INT_ENC(0, 0, 8);
                } else if (!has_datasec && btf_is_datasec(t)) {
                        /* replace DATASEC with STRUCT */
  [...]

The main changes are:

1) Addition of core parts of compile once - run everywhere (co-re) effort,
   that is, relocation of fields offsets in libbpf as well as exposure of
   kernel's own BTF via sysfs and loading through libbpf, from Andrii.

   More info on co-re: http://vger.kernel.org/bpfconf2019.html#session-2
   and http://vger.kernel.org/lpc-bpf2018.html#session-2

2) Enable passing input flags to the BPF flow dissector to customize parsing
   and allowing it to stop early similar to the C based one, from Stanislav.

3) Add a BPF helper function that allows generating SYN cookies from XDP and
   tc BPF, from Petar.

4) Add devmap hash-based map type for more flexibility in device lookup for
   redirects, from Toke.

5) Improvements to XDP forwarding sample code now utilizing recently enabled
   devmap lookups, from Jesper.

6) Add support for reporting the effective cgroup progs in bpftool, from Jakub
   and Takshak.

7) Fix reading kernel config from bpftool via /proc/config.gz, from Peter.

8) Fix AF_XDP umem pages mapping for 32 bit architectures, from Ivan.

9) Follow-up to add two more BPF loop tests for the selftest suite, from Alexei.

10) Add perf event output helper also for other skb-based program types, from Allan.

11) Fix a co-re related compilation error in selftests, from Yonghong.
====================

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
2019-08-13 16:24:57 -07:00
Jeremy Sowden
78458e3e08 netfilter: add missing IS_ENABLED(CONFIG_NETFILTER) checks to some header-files.
linux/netfilter.h defines a number of struct and inline function
definitions which are only available is CONFIG_NETFILTER is enabled.
These structs and functions are used in declarations and definitions in
other header-files.  Added preprocessor checks to make sure these
headers will compile if CONFIG_NETFILTER is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:15:18 +02:00
Jeremy Sowden
0abc8bf4f2 netfilter: add missing IS_ENABLED(CONFIG_NF_CONNTRACK) checks to some header-files.
struct nf_conn contains a "struct nf_conntrack ct_general" member and
struct net contains a "struct netns_ct ct" member which are both only
defined in CONFIG_NF_CONNTRACK is enabled.  These members are used in a
number of inline functions defined in other header-files.  Added
preprocessor checks to make sure the headers will compile if
CONFIG_NF_CONNTRACK is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:15:08 +02:00
Jeremy Sowden
47e640af2e netfilter: add missing IS_ENABLED(CONFIG_NF_TABLES) check to header-file.
nf_tables.h defines an API comprising several inline functions and
macros that depend on the nft member of struct net.  However, this is
only defined is CONFIG_NF_TABLES is enabled.  Added preprocessor checks
to ensure that nf_tables.h will compile if CONFIG_NF_TABLES is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:14:58 +02:00
Jeremy Sowden
9211bfbff8 netfilter: add missing IS_ENABLED(CONFIG_BRIDGE_NETFILTER) checks to header-file.
br_netfilter.h defines inline functions that use an enum constant and
struct member that are only defined if CONFIG_BRIDGE_NETFILTER is
enabled.  Added preprocessor checks to ensure br_netfilter.h will
compile if CONFIG_BRIDGE_NETFILTER is disabled.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:14:49 +02:00
Jeremy Sowden
a1b2f04ea5 netfilter: add missing includes to a number of header-files.
A number of netfilter header-files used declarations and definitions
from other headers without including them.  Added include directives to
make those declarations and definitions available.

Signed-off-by: Jeremy Sowden <jeremy@azazel.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:14:39 +02:00
Pablo Neira Ayuso
43dd16efc7 netfilter: nf_tables: store data in offload context registers
Store immediate data into offload context register. This allows follow
up instructions to take it from the corresponding source register.

This patch is required to support for payload mangling, although other
instructions that take data from source register will benefit from this
too.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-13 12:10:01 +02:00
Daniel Borkmann
cd48bdda4f sock: make cookie generation global instead of per netns
Generating and retrieving socket cookies are a useful feature that is
exposed to BPF for various program types through bpf_get_socket_cookie()
helper.

The fact that the cookie counter is per netns is quite a limitation
for BPF in practice in particular for programs in host namespace that
use socket cookies as part of a map lookup key since they will be
causing socket cookie collisions e.g. when attached to BPF cgroup hooks
or cls_bpf on tc egress in host namespace handling container traffic
from veth or ipvlan devices with peer in different netns. Change the
counter to be global instead.

Socket cookie consumers must assume the value as opqaue in any case.
Not every socket must have a cookie generated and knowledge of the
counter value itself does not provide much value either way hence
conversion to global is fine.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Martynas Pumputis <m@lambda.lt>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09 13:14:46 -07:00
Josh Hunt
1555e6fdf0 tcp: Update TCP_BASE_MSS comment
TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
enabled. Update the comment to reflect this.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09 13:03:30 -07:00
Josh Hunt
c04b79b6cf tcp: add new tcp_mtu_probe_floor sysctl
The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt <johunt@akamai.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09 13:03:30 -07:00
Jiri Pirko
3a5e523479 devlink: remove pointless data_len arg from region snapshot create
The size of the snapshot has to be the same as the size of the region,
therefore no need to pass it again during snapshot creation. Remove the
arg and use region->size instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-09 11:21:46 -07:00
Pablo Neira Ayuso
6a0a8d10a3 netfilter: nf_tables: use-after-free in failing rule with bound set
If a rule that has already a bound anonymous set fails to be added, the
preparation phase releases the rule and the bound set. However, the
transaction object from the abort path still has a reference to the set
object that is stale, leading to a use-after-free when checking for the
set->bound field. Add a new field to the transaction that specifies if
the set is bound, so the abort path can skip releasing it since the rule
command owns it and it takes care of releasing it. After this update,
the set->bound field is removed.

[   24.649883] Unable to handle kernel paging request at virtual address 0000000000040434
[   24.657858] Mem abort info:
[   24.660686]   ESR = 0x96000004
[   24.663769]   Exception class = DABT (current EL), IL = 32 bits
[   24.669725]   SET = 0, FnV = 0
[   24.672804]   EA = 0, S1PTW = 0
[   24.675975] Data abort info:
[   24.678880]   ISV = 0, ISS = 0x00000004
[   24.682743]   CM = 0, WnR = 0
[   24.685723] user pgtable: 4k pages, 48-bit VAs, pgdp=0000000428952000
[   24.692207] [0000000000040434] pgd=0000000000000000
[   24.697119] Internal error: Oops: 96000004 [#1] SMP
[...]
[   24.889414] Call trace:
[   24.891870]  __nf_tables_abort+0x3f0/0x7a0
[   24.895984]  nf_tables_abort+0x20/0x40
[   24.899750]  nfnetlink_rcv_batch+0x17c/0x588
[   24.904037]  nfnetlink_rcv+0x13c/0x190
[   24.907803]  netlink_unicast+0x18c/0x208
[   24.911742]  netlink_sendmsg+0x1b0/0x350
[   24.915682]  sock_sendmsg+0x4c/0x68
[   24.919185]  ___sys_sendmsg+0x288/0x2c8
[   24.923037]  __sys_sendmsg+0x7c/0xd0
[   24.926628]  __arm64_sys_sendmsg+0x2c/0x38
[   24.930744]  el0_svc_common.constprop.0+0x94/0x158
[   24.935556]  el0_svc_handler+0x34/0x90
[   24.939322]  el0_svc+0x8/0xc
[   24.942216] Code: 37280300 f9404023 91014262 aa1703e0 (f9401863)
[   24.948336] ---[ end trace cebbb9dcbed3b56f ]---

Fixes: f6ac858589 ("netfilter: nf_tables: unbind set in rule from commit path")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-09 14:41:13 +02:00
Jakub Kicinski
414776621d net/tls: prevent skb_orphan() from leaking TLS plain text with offload
sk_validate_xmit_skb() and drivers depend on the sk member of
struct sk_buff to identify segments requiring encryption.
Any operation which removes or does not preserve the original TLS
socket such as skb_orphan() or skb_clone() will cause clear text
leaks.

Make the TCP socket underlying an offloaded TLS connection
mark all skbs as decrypted, if TLS TX is in offload mode.
Then in sk_validate_xmit_skb() catch skbs which have no socket
(or a socket with no validation) and decrypted flag set.

Note that CONFIG_SOCK_VALIDATE_XMIT, CONFIG_TLS_DEVICE and
sk->sk_validate_xmit_skb are slightly interchangeable right now,
they all imply TLS offload. The new checks are guarded by
CONFIG_TLS_DEVICE because that's the option guarding the
sk_buff->decrypted member.

Second, smaller issue with orphaning is that it breaks
the guarantee that packets will be delivered to device
queues in-order. All TLS offload drivers depend on that
scheduling property. This means skb_orphan_partial()'s
trick of preserving partial socket references will cause
issues in the drivers. We need a full orphan, and as a
result netem delay/throttling will cause all TLS offload
skbs to be dropped.

Reusing the sk_buff->decrypted flag also protects from
leaking clear text when incoming, decrypted skb is redirected
(e.g. by TC).

See commit 0608c69c9a ("bpf: sk_msg, sock{map|hash} redirect
through ULP") for justification why the internal flag is safe.
The only location which could leak the flag in is tcp_bpf_sendmsg(),
which is taken care of by clearing the previously unused bit.

v2:
 - remove superfluous decrypted mark copy (Willem);
 - remove the stale doc entry (Boris);
 - rely entirely on EOR marking to prevent coalescing (Boris);
 - use an internal sendpages flag instead of marking the socket
   (Boris).
v3 (Willem):
 - reorganize the can_skb_orphan_partial() condition;
 - fix the flag leak-in through tcp_bpf_sendmsg.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 22:39:35 -07:00
wenxu
9a32669fec netfilter: nf_tables_offload: support indr block call
nftable support indr-block call. It makes nftable an offload vlan
and tunnel device.

nft add table netdev firewall
nft add chain netdev firewall aclout { type filter hook ingress offload device mlx_pf0vf0 priority - 300 \; }
nft add rule netdev firewall aclout ip daddr 10.0.0.1 fwd to vlan0
nft add chain netdev firewall aclin { type filter hook ingress device vlan0 priority - 300 \; }
nft add rule netdev firewall aclin ip daddr 10.0.0.7 fwd to mlx_pf0vf0

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 18:44:30 -07:00
wenxu
1150ab0f1b flow_offload: support get multi-subsystem block
It provide a callback list to find the blocks of tc
and nft subsystems

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 18:44:30 -07:00
wenxu
4e481908c5 flow_offload: move tc indirect block to flow offload
move tc indirect block to flow_offload and rename
it to flow indirect block.The nf_tables can use the
indr block architecture.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 18:44:30 -07:00
Guillaume Nault
891584f48a inet: frags: re-introduce skb coalescing for local delivery
Before commit d4289fcc9b ("net: IP6 defrag: use rbtrees for IPv6
defrag"), a netperf UDP_STREAM test[0] using big IPv6 datagrams (thus
generating many fragments) and running over an IPsec tunnel, reported
more than 6Gbps throughput. After that patch, the same test gets only
9Mbps when receiving on a be2net nic (driver can make a big difference
here, for example, ixgbe doesn't seem to be affected).

By reusing the IPv4 defragmentation code, IPv6 lost fragment coalescing
(IPv4 fragment coalescing was dropped by commit 14fe22e334 ("Revert
"ipv4: use skb coalescing in defragmentation"")).

Without fragment coalescing, be2net runs out of Rx ring entries and
starts to drop frames (ethtool reports rx_drops_no_frags errors). Since
the netperf traffic is only composed of UDP fragments, any lost packet
prevents reassembly of the full datagram. Therefore, fragments which
have no possibility to ever get reassembled pile up in the reassembly
queue, until the memory accounting exeeds the threshold. At that point
no fragment is accepted anymore, which effectively discards all
netperf traffic.

When reassembly timeout expires, some stale fragments are removed from
the reassembly queue, so a few packets can be received, reassembled
and delivered to the netperf receiver. But the nic still drops frames
and soon the reassembly queue gets filled again with stale fragments.
These long time frames where no datagram can be received explain why
the performance drop is so significant.

Re-introducing fragment coalescing is enough to get the initial
performances again (6.6Gbps with be2net): driver doesn't drop frames
anymore (no more rx_drops_no_frags errors) and the reassembly engine
works at full speed.

This patch is quite conservative and only coalesces skbs for local
IPv4 and IPv6 delivery (in order to avoid changing skb geometry when
forwarding). Coalescing could be extended in the future if need be, as
more scenarios would probably benefit from it.

[0]: Test configuration
Sender:
ip xfrm policy flush
ip xfrm state flush
ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir in tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir out tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
netserver -D -L fc00:2::1

Receiver:
ip xfrm policy flush
ip xfrm state flush
ip xfrm state add src fc00:2::1 dst fc00:1::1 proto esp spi 0x1001 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:2::1 dst fc00:1::1
ip xfrm policy add src fc00:2::1 dst fc00:1::1 dir in tmpl src fc00:2::1 dst fc00:1::1 proto esp mode transport action allow
ip xfrm state add src fc00:1::1 dst fc00:2::1 proto esp spi 0x1000 aead 'rfc4106(gcm(aes))' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b 96 mode transport sel src fc00:1::1 dst fc00:2::1
ip xfrm policy add src fc00:1::1 dst fc00:2::1 dir out tmpl src fc00:1::1 dst fc00:2::1 proto esp mode transport action allow
netperf -H fc00:2::1 -f k -P 0 -L fc00:1::1 -l 60 -t UDP_STREAM -I 99,5 -i 5,5 -T5,5 -6

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-08 15:55:10 -07:00
David S. Miller
13dfb3fa49 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Just minor overlapping changes in the conflicts here.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 18:44:57 -07:00
John Hurley
48e584ac58 net: sched: add ingress mirred action to hardware IR
TC mirred actions (redirect and mirred) can send to egress or ingress of a
device. Currently only egress is used for hw offload rules.

Modify the intermediate representation for hw offload to include mirred
actions that go to ingress. This gives drivers access to such rules and
can decide whether or not to offload them.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 14:24:21 -07:00
John Hurley
d7609c96c6 net: tc_act: add helpers to detect ingress mirred actions
TC mirred actions can send to egress or ingress on a given netdev. Helpers
exist to detect actions that are mirred to egress. Extend the header file
to include helpers to detect ingress mirred actions.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 14:24:21 -07:00
John Hurley
fb1b775a24 net: sched: add skbedit of ptype action to hardware IR
TC rules can impliment skbedit actions. Currently actions that modify the
skb mark are passed to offloading drivers via the hardware intermediate
representation in the flow_offload API.

Extend this to include skbedit actions that modify the packet type of the
skb. Such actions may be used to set the ptype to HOST when redirecting a
packet to ingress.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 14:24:21 -07:00
John Hurley
77feb4eed7 net: tc_act: add skbedit_ptype helper functions
The tc_act header file contains an inline function that checks if an
action is changing the skb mark of a packet and a further function to
extract the mark.

Add similar functions to check for and get skbedit actions that modify
the packet type of the skb.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 14:24:21 -07:00
Vlad Buslov
67cbf7dedd net: sched: sample: allow accessing psample_group with rtnl
Recently implemented support for sample action in flow_offload infra leads
to following rcu usage warning:

[ 1938.234856] =============================
[ 1938.234858] WARNING: suspicious RCU usage
[ 1938.234863] 5.3.0-rc1+ #574 Not tainted
[ 1938.234866] -----------------------------
[ 1938.234869] include/net/tc_act/tc_sample.h:47 suspicious rcu_dereference_check() usage!
[ 1938.234872]
               other info that might help us debug this:

[ 1938.234875]
               rcu_scheduler_active = 2, debug_locks = 1
[ 1938.234879] 1 lock held by tc/19540:
[ 1938.234881]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
[ 1938.234900]
               stack backtrace:
[ 1938.234905] CPU: 2 PID: 19540 Comm: tc Not tainted 5.3.0-rc1+ #574
[ 1938.234908] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 1938.234911] Call Trace:
[ 1938.234922]  dump_stack+0x85/0xc0
[ 1938.234930]  tc_setup_flow_action+0xed5/0x2040
[ 1938.234944]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
[ 1938.234965]  fl_change+0xd24/0x1b30 [cls_flower]
[ 1938.234990]  tc_new_tfilter+0x3e0/0x970
[ 1938.235021]  ? tc_del_tfilter+0x720/0x720
[ 1938.235028]  rtnetlink_rcv_msg+0x389/0x4b0
[ 1938.235038]  ? netlink_deliver_tap+0x95/0x400
[ 1938.235044]  ? rtnl_dellink+0x2d0/0x2d0
[ 1938.235053]  netlink_rcv_skb+0x49/0x110
[ 1938.235063]  netlink_unicast+0x171/0x200
[ 1938.235073]  netlink_sendmsg+0x224/0x3f0
[ 1938.235091]  sock_sendmsg+0x5e/0x60
[ 1938.235097]  ___sys_sendmsg+0x2ae/0x330
[ 1938.235111]  ? __handle_mm_fault+0x12cd/0x19e0
[ 1938.235125]  ? __handle_mm_fault+0x12cd/0x19e0
[ 1938.235138]  ? find_held_lock+0x2b/0x80
[ 1938.235147]  ? do_user_addr_fault+0x22d/0x490
[ 1938.235160]  __sys_sendmsg+0x59/0xa0
[ 1938.235178]  do_syscall_64+0x5c/0xb0
[ 1938.235187]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1938.235192] RIP: 0033:0x7ff9a4d597b8
[ 1938.235197] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
 ec 28 89 54
[ 1938.235200] RSP: 002b:00007ffcfe381c48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1938.235205] RAX: ffffffffffffffda RBX: 000000005d4497f9 RCX: 00007ff9a4d597b8
[ 1938.235208] RDX: 0000000000000000 RSI: 00007ffcfe381cb0 RDI: 0000000000000003
[ 1938.235211] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[ 1938.235214] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
[ 1938.235217] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001

Change tcf_sample_psample_group() helper to allow using it from both rtnl
and rcu protected contexts.

Fixes: a7a7be6087 ("net/sched: add sample action to the hardware intermediate representation")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 14:15:39 -07:00
Vlad Buslov
c4bd48699b net: sched: police: allow accessing police->params with rtnl
Recently implemented support for police action in flow_offload infra leads
to following rcu usage warning:

[ 1925.881092] =============================
[ 1925.881094] WARNING: suspicious RCU usage
[ 1925.881098] 5.3.0-rc1+ #574 Not tainted
[ 1925.881100] -----------------------------
[ 1925.881104] include/net/tc_act/tc_police.h:57 suspicious rcu_dereference_check() usage!
[ 1925.881106]
               other info that might help us debug this:

[ 1925.881109]
               rcu_scheduler_active = 2, debug_locks = 1
[ 1925.881112] 1 lock held by tc/18591:
[ 1925.881115]  #0: 00000000b03cb918 (rtnl_mutex){+.+.}, at: tc_new_tfilter+0x47c/0x970
[ 1925.881124]
               stack backtrace:
[ 1925.881127] CPU: 2 PID: 18591 Comm: tc Not tainted 5.3.0-rc1+ #574
[ 1925.881130] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 1925.881132] Call Trace:
[ 1925.881138]  dump_stack+0x85/0xc0
[ 1925.881145]  tc_setup_flow_action+0x1771/0x2040
[ 1925.881155]  fl_hw_replace_filter+0x11f/0x2e0 [cls_flower]
[ 1925.881175]  fl_change+0xd24/0x1b30 [cls_flower]
[ 1925.881200]  tc_new_tfilter+0x3e0/0x970
[ 1925.881231]  ? tc_del_tfilter+0x720/0x720
[ 1925.881243]  rtnetlink_rcv_msg+0x389/0x4b0
[ 1925.881250]  ? netlink_deliver_tap+0x95/0x400
[ 1925.881257]  ? rtnl_dellink+0x2d0/0x2d0
[ 1925.881264]  netlink_rcv_skb+0x49/0x110
[ 1925.881275]  netlink_unicast+0x171/0x200
[ 1925.881284]  netlink_sendmsg+0x224/0x3f0
[ 1925.881299]  sock_sendmsg+0x5e/0x60
[ 1925.881305]  ___sys_sendmsg+0x2ae/0x330
[ 1925.881309]  ? task_work_add+0x43/0x50
[ 1925.881314]  ? fput_many+0x45/0x80
[ 1925.881329]  ? __lock_acquire+0x248/0x1930
[ 1925.881342]  ? find_held_lock+0x2b/0x80
[ 1925.881347]  ? task_work_run+0x7b/0xd0
[ 1925.881359]  __sys_sendmsg+0x59/0xa0
[ 1925.881375]  do_syscall_64+0x5c/0xb0
[ 1925.881381]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1925.881384] RIP: 0033:0x7feb245047b8
[ 1925.881388] Code: 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 8d 05 65 8f 0c 00 8b 00 85 c0 75 17 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 58 c3 0f 1f 80 00 00 00 00 48 83
 ec 28 89 54
[ 1925.881391] RSP: 002b:00007ffc2d2a5788 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1925.881395] RAX: ffffffffffffffda RBX: 000000005d4497ed RCX: 00007feb245047b8
[ 1925.881398] RDX: 0000000000000000 RSI: 00007ffc2d2a57f0 RDI: 0000000000000003
[ 1925.881400] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[ 1925.881403] R10: 0000000000404ec2 R11: 0000000000000246 R12: 0000000000000001
[ 1925.881406] R13: 0000000000480640 R14: 0000000000000012 R15: 0000000000000001

Change tcf_police_rate_bytes_ps() and tcf_police_tcfp_burst() helpers to
allow using them from both rtnl and rcu protected contexts.

Fixes: 8c8cfc6ed2 ("net/sched: add police action to the hardware intermediate representation")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Reviewed-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-06 14:15:39 -07:00
Jakub Kicinski
5d92e631b8 net/tls: partially revert fix transition through disconnect with close
Looks like we were slightly overzealous with the shutdown()
cleanup. Even though the sock->sk_state can reach CLOSED again,
socket->state will not got back to SS_UNCONNECTED once
connections is ESTABLISHED. Meaning we will see EISCONN if
we try to reconnect, and EINVAL if we try to listen.

Only listen sockets can be shutdown() and reused, but since
ESTABLISHED sockets can never be re-connected() or used for
listen() we don't need to try to clean up the ULP state early.

Fixes: 32857cf57f ("net/tls: fix transition through disconnect with close")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-08-05 13:15:30 -07:00
Fernando Fernandez Mancera
8c0bb78738 netfilter: synproxy: rename mss synproxy_options field
After introduce "mss_encode" field in the synproxy_options struct the field
"mss" is a little confusing. It has been renamed to "mss_option".

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-08-03 18:39:08 +02:00
David S. Miller
ac5fe22636 We have a reasonably large number of changes:
* lots more HE (802.11ax) support, particularly things
    relevant for the the AP side, but also mesh support
  * debugfs cleanups from Greg
  * some more work on extended key ID
  * start using genl parallel_ops, as preparation for
    weaning ourselves off RTNL and getting parallelism
  * various other changes all over
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl1BtsAACgkQB8qZga/f
 l8QTdQ//bD3NLitiIW4rBmC/w9str2TRpUbRN8Hf9xTnRh1smLGQop8jI8xpYdAE
 hXFlWn4glCkTkZrqtT8IDMcERb2YPHI94L7lMR1L6dXH7e1tiBZ7Qum5+rojLUCQ
 /XXZJnVa9AbdzBDtkhzCjkN8dCldMnkId0m6vkGdFre/b70FLDg2GlolqIchbDCz
 GxeKaw/uTLwu9nxEJhspDmiXDQtQd1ZsGNZRrovQ2nUuUfvX9sU54OmaoDHhqrUM
 iSwFZJAhrCV7nidOLs2X1rVs8hRymbrYnGu6NdUlfTQMqaOVgWhqniM1RMnJnnVi
 Ec/1OQg4KIj2rjJIXW/3QXAzAO6TDwV8xNk9al3m+yAkkSXR7tXP01plNUoMU4UW
 YxcaPD341H919GIqa71lieeEMyi7XIKKFfaZvGFLiDzidKKi0JkDuC+1w5UOJrbj
 PM/dzvKqvwOMEa5XahjrUQLsiGMgxBkDo36+F4rfaelzDKPuJBOokDS23vg/lYf5
 2v74WbeJVc45cAhSwp5/0RTsG0xhmFImfrE66VNB1vcJUVLvHTuATfXsBbRahOz/
 SglGBAGqigxJwIeYfgec3lPNfdV+0LwYnPAEjBIjCO8qlGvst4jLroHg7ayApr2E
 JfUbaWtpEb9cgAz7X6tEjv/BAnsRWPzpxb8Nm6JoZttTtA2VakE=
 =eUnh
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2019-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
We have a reasonably large number of changes:
 * lots more HE (802.11ax) support, particularly things
   relevant for the the AP side, but also mesh support
 * debugfs cleanups from Greg
 * some more work on extended key ID
 * start using genl parallel_ops, as preparation for
   weaning ourselves off RTNL and getting parallelism
 * various other changes all over
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-31 08:59:41 -07:00
David S. Miller
f86a677e57 Just a few fixes:
* revert NETIF_F_LLTX usage as it caused problems
  * avoid warning on WMM parameters from AP that are too short
  * fix possible null-ptr dereference in hwsim
  * fix interface combinations with 4-addr and crypto control
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl1BfGcACgkQB8qZga/f
 l8TE/A/9G98DIi7UpuseyrW1y+3Bab5G+N4JhNDYMF0fC+f+6VZPsQRQi0aH2GVE
 fBDnFQogiqIY8WwgQ5RoyUjOux43INpNp3uPjOCzcdjj9+D+lBX/27yhNVBVYK5H
 Uxq08W2Mv+GWcswa5NMGUF9FlARbIMSu1BsDML031LzSckNvSLwMuANSx/1/d/gr
 rUa1waaGeC2yT/baefNaHDiqYBj/6Xm6U0Adl1PphGJZ587UjlylCwozNNIrUIwW
 PeysRXIUakePEqNs0Kl6U6lUJoH3g1KB+fLl5ea94L2aBHHRRaBAG1yROidg1/JG
 fmz8jostoasG+wuycNSdETVgY3TT37Hv0R7rRB0Wj5QvrdXqgyrAZSR97dMWR9wR
 fZLZCXtYnxOVZwKUlDusSjDFm9CKdhM+KqINHXBa6RRWpQjFYn4AvbQlJEZrbyuJ
 lBOvXAvmF1fsX5x9hmDLRM7Irtb9Awb7eHbT+T5c+BtWKvnR2GsujMy2f6TY+swN
 WPmER6MtQ80VPk9xpLVxU02vLs0EIdoPxE7I348ozDmW6sln0kRI1Pp7uOqDO1RC
 sX6uA8bSTp9u2tKsgevkHqiz3YiJOqmy475tcOg9F/CefsULZzVfBE0bsU4v0jnJ
 LTXJS2GtmPhG4AJFn+qiqvvOzJGhAJAEsvmPvUgH0AmB4m0+Cqg=
 =rE1e
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-07-31' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a few fixes:
 * revert NETIF_F_LLTX usage as it caused problems
 * avoid warning on WMM parameters from AP that are too short
 * fix possible null-ptr dereference in hwsim
 * fix interface combinations with 4-addr and crypto control
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-31 08:51:34 -07:00
John Crispin
1ced169cc1 mac80211: allow setting spatial reuse parameters from bss_conf
Store the OBSS PD parameters inside bss_conf when bringing up an AP and/or
when a station connects to an AP. This allows the driver to configure the
HW accordingly.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190730163701.18836-3-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-31 11:00:52 +02:00
John Crispin
796e90f42b cfg80211: add support for parsing OBBS_PD attributes
Add the data structure, policy and parsing code allowing userland to send
the OBSS PD information into the kernel.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190730163701.18836-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-31 11:00:52 +02:00
Petar Penkov
9349d600fb tcp: add skb-less helpers to retrieve SYN cookie
This patch allows generation of a SYN cookie before an SKB has been
allocated, as is the case at XDP.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Reviewed-by: Lorenz Bauer <lmb@cloudflare.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-07-30 21:03:05 -07:00
Tristram Ha
016e43a26b net: dsa: ksz: Add KSZ8795 tag code
Add DSA tag code for Microchip KSZ8795 switch. The switch is simpler
and the tag is only 1 byte, instead of 2 as is the case with KSZ9477.

Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Andrew Lunn <andrew@lunn.ch>
Cc: David S. Miller <davem@davemloft.net>
Cc: Florian Fainelli <f.fainelli@gmail.com>
Cc: Tristram Ha <Tristram.Ha@microchip.com>
Cc: Vivien Didelot <vivien.didelot@gmail.com>
Cc: Woojung Huh <woojung.huh@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-30 15:12:50 -07:00
John Crispin
697f6c507c mac80211: propagate HE operation info into bss_conf
Upon a successful assoc a station shall store the content of the HE
operation element inside bss_conf so that the driver can setup the
hardware accordingly.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: Kalle Valo <kvalo@codeaurora.org>
Link: https://lore.kernel.org/r/20190729102342.8659-2-john@phrozen.org
[use struct copy]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-29 16:39:37 +02:00
Lorenzo Bianconi
a0b4496a43 mac80211: add IEEE80211_KEY_FLAG_GENERATE_MMIE to ieee80211_key_flags
Add IEEE80211_KEY_FLAG_GENERATE_MMIE flag to ieee80211_key_flags in order
to allow the driver to notify mac80211 to generate MMIE and that it
requires sequence number generation only.
This is a preliminary patch to add BIP_CMAC_128 hw support to mt7615
driver

Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org>
Link: https://lore.kernel.org/r/dfe275f9aa0f1cc6b33085f9efd5d8447f68ad13.1563228405.git.lorenzo@kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 16:14:12 +02:00
Ondrej Mosnacek
91b05a7e7d crypto: user - make NETLINK_CRYPTO work inside netns
Currently, NETLINK_CRYPTO works only in the init network namespace. It
doesn't make much sense to cut it out of the other network namespaces,
so do the minor plumbing work necessary to make it work in any network
namespace. Code inspired by net/core/sock_diag.c.

Tested using kcapi-dgst from libkcapi [1]:
Before:
    # unshare -n kcapi-dgst -c sha256 </dev/null | wc -c
    libkcapi - Error: Netlink error: sendmsg failed
    libkcapi - Error: Netlink error: sendmsg failed
    libkcapi - Error: NETLINK_CRYPTO: cannot obtain cipher information for hmac(sha512) (is required crypto_user.c patch missing? see documentation)
    0

After:
    # unshare -n kcapi-dgst -c sha256 </dev/null | wc -c
    32

[1] https://github.com/smuellerDD/libkcapi

Signed-off-by: Ondrej Mosnacek <omosnace@redhat.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2019-07-26 22:08:02 +10:00
Manikanta Pubbisetty
e6f4051123 {nl,mac}80211: fix interface combinations on crypto controlled devices
Commit 33d915d9e8 ("{nl,mac}80211: allow 4addr AP operation on
crypto controlled devices") has introduced a change which allows
4addr operation on crypto controlled devices (ex: ath10k). This
change has inadvertently impacted the interface combinations logic
on such devices.

General rule is that software interfaces like AP/VLAN should not be
listed under supported interface combinations and should not be
considered during validation of these combinations; because of the
aforementioned change, AP/VLAN interfaces(if present) will be checked
against interfaces supported by the device and blocks valid interface
combinations.

Consider a case where an AP and AP/VLAN are up and running; when a
second AP device is brought up on the same physical device, this AP
will be checked against the AP/VLAN interface (which will not be
part of supported interface combinations of the device) and blocks
second AP to come up.

Add a new API cfg80211_iftype_allowed() to fix the problem, this
API works for all devices with/without SW crypto control.

Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org>
Fixes: 33d915d9e8 ("{nl,mac}80211: allow 4addr AP operation on crypto controlled devices")
Link: https://lore.kernel.org/r/1563779690-9716-1-git-send-email-mpubbise@codeaurora.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:50:43 +02:00
John Crispin
cbe77dde47 mac80211: add xmit rate to struct ieee80211_tx_status
Right now struct ieee80211_tx_rate cannot hold HE rates. Lets use
struct ieee80211_tx_status instead. This will also make the code
future-proof for when we have EHT.

Signed-off-by: John Crispin <john@phrozen.org>
Link: https://lore.kernel.org/r/20190714154419.11854-2-john@phrozen.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:41:45 +02:00
Alexander Wetzel
dc3998ec5c mac80211: AMPDU handling for rekeys with Extended Key ID
Extended Key ID allows A-MPDU sessions while rekeying as long as each
A-MPDU aggregates only MPDUs with one keyid together.

Drivers able to segregate MPDUs accordingly can tell mac80211 to not
stop A-MPDU sessions when rekeying by setting the new flag
IEEE80211_HW_AMPDU_KEYBORDER_SUPPORT.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20190629195015.19680-3-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:29:10 +02:00
Alexander Wetzel
3e47bf1ca4 mac80211: Simplify Extended Key ID API
1) Drop IEEE80211_HW_EXT_KEY_ID_NATIVE and let drivers directly set
   the NL80211_EXT_FEATURE_EXT_KEY_ID flag.

2) Drop IEEE80211_HW_NO_AMPDU_KEYBORDER_SUPPORT and simply assume all
   drivers are unable to handle A-MPDU key borders.

The new Extended Key ID API now requires all mac80211 drivers to set
NL80211_EXT_FEATURE_EXT_KEY_ID when they implement set_key() and can
handle Extended Key ID. For drivers not providing set_key() mac80211
itself enables Extended Key ID support, using the internal SW crypto
services.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Link: https://lore.kernel.org/r/20190629195015.19680-2-alexander@wetzel-home.de
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:28:59 +02:00
Erik Stromdahl
fb0e76abe3 mac80211: add tx dequeue function for process context
Since ieee80211_tx_dequeue() must not be called with softirqs enabled
(i.e. from process context without proper disable of bottom halves),
we add a wrapper that disables bottom halves before calling
ieee80211_tx_dequeue()

The new function is named ieee80211_tx_dequeue_ni() just as all other
from-process-context versions found in mac80211.

The documentation of ieee80211_tx_dequeue() is also updated so it
mentions that the function should not be called from process context.

Signed-off-by: Erik Stromdahl <erik.stromdahl@gmail.com>
Link: https://lore.kernel.org/r/20190617200140.6189-1-erik.stromdahl@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:23:19 +02:00
Greg Kroah-Hartman
612fcfd9b3 mac80211: remove unused and unneeded remove_sta_debugfs callback
The remove_sta_debugfs callback in struct rate_control_ops is no longer
used by any driver, as there is no need for it (the debugfs directory is
already removed recursivly by the mac80211 core.)  Because no one needs
it, just remove it to keep anyone else from accidentally using it in the
future.

Cc: Johannes Berg <johannes@sipsolutions.net>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-wireless@vger.kernel.org
Cc: netdev@vger.kernel.org
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Link: https://lore.kernel.org/r/20190612142658.12792-5-gregkh@linuxfoundation.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:21:12 +02:00
Emmanuel Grumbach
5db4c4b955 mac80211: pass the vif to cancel_remain_on_channel
This low level driver can find it useful to get the vif
when a remain on channel session is cancelled.

iwlwifi will need this soon.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Link: https://lore.kernel.org/r/20190723180001.5828-1-emmanuel.grumbach@intel.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-26 13:08:28 +02:00
David S. Miller
28ba934d28 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Alexei Starovoitov says:

====================
pull-request: bpf 2019-07-25

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) fix segfault in libbpf, from Andrii.

2) fix gso_segs access, from Eric.

3) tls/sockmap fixes, from Jakub and John.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-25 17:35:03 -07:00
John Hurley
6749d59016 net: sched: include mpls actions in hardware intermediate representation
A recent addition to TC actions is the ability to manipulate the MPLS
headers on packets.

In preparation to offload such actions to hardware, update the IR code to
accept and prepare the new actions.

Note that no driver currently impliments the MPLS dec_ttl action so this
is not included.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-23 13:52:50 -07:00
Maciej Żenczykowski
66b5f1c439 net-ipv6-ndisc: add support for RFC7710 RA Captive Portal Identifier
This is trivial since we already have support for the entirely
identical (from the kernel's point of view) RDNSS and DNSSL that
also contain opaque data that needs to be passed down to userspace.

As specified in RFC7710, Captive Portal option contains a URL.
8-bit identifier of the option type as assigned by the IANA is 37.
This option should also be treated as userland.

Hence, treat ND option 37 as userland (Captive Portal support)

See:
  https://tools.ietf.org/html/rfc7710
  https://www.iana.org/assignments/icmpv6-parameters/icmpv6-parameters.xhtml

Fixes: e35f30c131
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Cc: Lorenzo Colitti <lorenzo@google.com>
Cc: Remin Nguyen Van <reminv@google.com>
Cc: Alexey I. Froloff <raorn@raorn.name>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-22 12:10:54 -07:00
John Fastabend
95fa145479 bpf: sockmap/tls, close can race with map free
When a map free is called and in parallel a socket is closed we
have two paths that can potentially reset the socket prot ops, the
bpf close() path and the map free path. This creates a problem
with which prot ops should be used from the socket closed side.

If the map_free side completes first then we want to call the
original lowest level ops. However, if the tls path runs first
we want to call the sockmap ops. Additionally there was no locking
around prot updates in TLS code paths so the prot ops could
be changed multiple times once from TLS path and again from sockmap
side potentially leaving ops pointed at either TLS or sockmap
when psock and/or tls context have already been destroyed.

To fix this race first only update ops inside callback lock
so that TLS, sockmap and lowest level all agree on prot state.
Second and a ULP callback update() so that lower layers can
inform the upper layer when they are being removed allowing the
upper layer to reset prot ops.

This gets us close to allowing sockmap and tls to be stacked
in arbitrary order but will save that patch for *next trees.

v4:
 - make sure we don't free things for device;
 - remove the checks which swap the callbacks back
   only if TLS is at the top.

Reported-by: syzbot+06537213db7ba2745c4a@syzkaller.appspotmail.com
Fixes: 02c558b2d5 ("bpf: sockmap, support for msg_peek in sk_msg with redirect ingress")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22 16:04:17 +02:00
John Fastabend
32857cf57f net/tls: fix transition through disconnect with close
It is possible (via shutdown()) for TCP socks to go through TCP_CLOSE
state via tcp_disconnect() without actually calling tcp_close which
would then call the tls close callback. Because of this a user could
disconnect a socket then put it in a LISTEN state which would break
our assumptions about sockets always being ESTABLISHED state.

More directly because close() can call unhash() and unhash is
implemented by sockmap if a sockmap socket has TLS enabled we can
incorrectly destroy the psock from unhash() and then call its close
handler again. But because the psock (sockmap socket representation)
is already destroyed we call close handler in sk->prot. However,
in some cases (TLS BASE/BASE case) this will still point at the
sockmap close handler resulting in a circular call and crash reported
by syzbot.

To fix both above issues implement the unhash() routine for TLS.

v4:
 - add note about tls offload still needing the fix;
 - move sk_proto to the cold cache line;
 - split TX context free into "release" and "free",
   otherwise the GC work itself is in already freed
   memory;
 - more TX before RX for consistency;
 - reuse tls_ctx_free();
 - schedule the GC work after we're done with context
   to avoid UAF;
 - don't set the unhash in all modes, all modes "inherit"
   TLS_BASE's callbacks anyway;
 - disable the unhash hook for TLS_HW.

Fixes: 3c4d755915 ("tls: kernel TLS support")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22 16:04:17 +02:00
John Fastabend
313ab00480 net/tls: remove sock unlock/lock around strp_done()
The tls close() callback currently drops the sock lock to call
strp_done(). Split up the RX cleanup into stopping the strparser
and releasing most resources, syncing strparser and finally
freeing the context.

To avoid the need for a strp_done() call on the cleanup path
of device offload make sure we don't arm the strparser until
we are sure init will be successful.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22 16:04:16 +02:00
John Fastabend
f87e62d45e net/tls: remove close callback sock unlock/lock around TX work flush
The tls close() callback currently drops the sock lock, makes a
cancel_delayed_work_sync() call, and then relocks the sock.

By restructuring the code we can avoid droping lock and then
reclaiming it. To simplify this we do the following,

 tls_sk_proto_close
 set_bit(CLOSING)
 set_bit(SCHEDULE)
 cancel_delay_work_sync() <- cancel workqueue
 lock_sock(sk)
 ...
 release_sock(sk)
 strp_done()

Setting the CLOSING bit prevents the SCHEDULE bit from being
cleared by any workqueue items e.g. if one happens to be
scheduled and run between when we set SCHEDULE bit and cancel
work. Then because SCHEDULE bit is set now no new work will
be scheduled.

Tested with net selftests and bpf selftests.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22 16:04:16 +02:00
Jakub Kicinski
318892ac06 net/tls: don't arm strparser immediately in tls_set_sw_offload()
In tls_set_device_offload_rx() we prepare the software context
for RX fallback and proceed to add the connection to the device.
Unfortunately, software context prep includes arming strparser
so in case of a later error we have to release the socket lock
to call strp_done().

In preparation for not releasing the socket lock half way through
callbacks move arming strparser into a separate function.
Following patches will make use of that.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-22 16:04:16 +02:00
Eric Dumazet
b617158dc0 tcp: be more careful in tcp_fragment()
Some applications set tiny SO_SNDBUF values and expect
TCP to just work. Recent patches to address CVE-2019-11478
broke them in case of losses, since retransmits might
be prevented.

We should allow these flows to make progress.

This patch allows the first and last skb in retransmit queue
to be split even if memory limits are hit.

It also adds the some room due to the fact that tcp_sendmsg()
and tcp_sendpage() might overshoot sk_wmem_queued by about one full
TSO skb (64KB size). Note this allowance was already present
in stable backports for kernels < 4.15

Note for < 4.15 backports :
 tcp_rtx_queue_tail() will probably look like :

static inline struct sk_buff *tcp_rtx_queue_tail(const struct sock *sk)
{
	struct sk_buff *skb = tcp_send_head(sk);

	return skb ? tcp_write_queue_prev(sk, skb) : tcp_write_queue_tail(sk);
}

Fixes: f070ef2ac6 ("tcp: tcp_fragment() should apply sane memory limits")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Andrew Prout <aprout@ll.mit.edu>
Tested-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Tested-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Christoph Paasch <cpaasch@apple.com>
Cc: Jonathan Looney <jtl@netflix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-21 20:41:24 -07:00
Johannes Berg
91046d6364 nl80211: fix VENDOR_CMD_RAW_DATA
Since ERR_PTR() is an inline, not a macro, just open-code it
here so it's usable as an initializer, fixing the build in
brcmfmac.

Reported-by: Arend Van Spriel <arend.vanspriel@broadcom.com>
Fixes: 901bb98918 ("nl80211: require and validate vendor command policy")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-07-20 21:37:32 +02:00
Pablo Neira Ayuso
14bfb13f0e net: flow_offload: add flow_block structure and use it
This object stores the flow block callbacks that are attached to this
block. Update flow_block_cb_lookup() to take this new object.

This patch restores the block sharing feature.

Fixes: da3eeb904f ("net: flow_offload: add list handling functions")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19 21:27:45 -07:00
Pablo Neira Ayuso
a732331151 net: flow_offload: rename tc_setup_cb_t to flow_setup_cb_t
Rename this type definition and adapt users.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19 21:27:45 -07:00
Pablo Neira Ayuso
0c7294ddae net: flow_offload: remove netns parameter from flow_block_cb_alloc()
No need to annotate the netns on the flow block callback object,
flow_block_cb_is_busy() already checks for used blocks.

Fixes: d63db30c85 ("net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19 21:27:45 -07:00
David S. Miller
9a2f97bb8d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix a deadlock when module is requested via netlink_bind()
   in nfnetlink, from Florian Westphal.

2) Fix ipt_rpfilter and ip6t_rpfilter with VRF, from Miaohe Lin.

3) Skip master comparison in SIP helper to fix expectation clash
   under two valid scenarios, from xiao ruizhu.

4) Remove obsolete comments in nf_conntrack codebase, from
   Yonatan Goldschmidt.

5) Fix redirect extension module autoload, from Christian Hesse.

6) Fix incorrect mssg option sent to client in synproxy,
   from Fernando Fernandez.

7) Fix incorrect window calculations in TCP conntrack, from
   Florian Westphal.

8) Don't bail out when updating basechain policy due to recent
   offload works, also from Florian.

9) Allow symhash to use modulus 1 as other hash extensions do,
   from Laura.Garcia.

10) Missing NAT chain module autoload for the inet family,
    from Phil Sutter.

11) Fix missing adjustment of TCP RST packet in synproxy,
    from Fernando Fernandez.

12) Skip EAGAIN path when nft_meta_bridge is built-in or
    not selected.

13) Conntrack bridge does not depend on nf_tables_bridge.

14) Turn NF_TABLES_BRIDGE into tristate to fix possible
    link break of nft_meta_bridge, from Arnd Bergmann.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-19 21:25:10 -07:00
Eric Dumazet
8d650cdeda tcp: fix tcp_set_congestion_control() use from bpf hook
Neal reported incorrect use of ns_capable() from bpf hook.

bpf_setsockopt(...TCP_CONGESTION...)
  -> tcp_set_congestion_control()
   -> ns_capable(sock_net(sk)->user_ns, CAP_NET_ADMIN)
    -> ns_capable_common()
     -> current_cred()
      -> rcu_dereference_protected(current->cred, 1)

Accessing 'current' in bpf context makes no sense, since packets
are processed from softirq context.

As Neal stated : The capability check in tcp_set_congestion_control()
was written assuming a system call context, and then was reused from
a BPF call site.

The fix is to add a new parameter to tcp_set_congestion_control(),
so that the ns_capable() call is only performed under the right
context.

Fixes: 91b5b21c7c ("bpf: Add support for changing congestion control")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Lawrence Brakmo <brakmo@fb.com>
Reported-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Lawrence Brakmo <brakmo@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-18 20:33:48 -07:00
Nicolas Dichtel
22d6552f82 xfrm interface: fix management of phydev
With the current implementation, phydev cannot be removed:

$ ip link add dummy type dummy
$ ip link add xfrm1 type xfrm dev dummy if_id 1
$ ip l d dummy
 kernel:[77938.465445] unregister_netdevice: waiting for dummy to become free. Usage count = 1

Manage it like in ip tunnels, ie just keep the ifindex. Not that the side
effect, is that the phydev is now optional.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Tested-by: Julien Floret <julien.floret@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-17 10:03:54 +02:00
Nicolas Dichtel
e0aaa332e6 xfrm interface: ifname may be wrong in logs
The ifname is copied when the interface is created, but is never updated
later. In fact, this property is used only in one error message, where the
netdevice pointer is available, thus let's use it.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-17 10:03:54 +02:00
Fernando Fernandez Mancera
b83329fb47 netfilter: synproxy: fix erroneous tcp mss option
Now synproxy sends the mss value set by the user on client syn-ack packet
instead of the mss value that client announced.

Fixes: 48b1de4c11 ("netfilter: add SYNPROXY core/target")
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-16 13:17:01 +02:00
xiao ruizhu
3c00fb0bf0 netfilter: nf_conntrack_sip: fix expectation clash
When conntracks change during a dialog, SDP messages may be sent from
different conntracks to establish expects with identical tuples. In this
case expects conflict may be detected for the 2nd SDP message and end up
with a process failure.

The fixing here is to reuse an existing expect who has the same tuple for a
different conntrack if any.

Here are two scenarios for the case.

1)
         SERVER                   CPE

           |      INVITE SDP       |
      5060 |<----------------------|5060
           |      100 Trying       |
      5060 |---------------------->|5060
           |      183 SDP          |
      5060 |---------------------->|5060    ===> Conntrack 1
           |       PRACK           |
     50601 |<----------------------|5060
           |    200 OK (PRACK)     |
     50601 |---------------------->|5060
           |    200 OK (INVITE)    |
      5060 |---------------------->|5060
           |        ACK            |
     50601 |<----------------------|5060
           |                       |
           |<--- RTP stream ------>|
           |                       |
           |    INVITE SDP (t38)   |
     50601 |---------------------->|5060    ===> Conntrack 2

With a certain configuration in the CPE, SIP messages "183 with SDP" and
"re-INVITE with SDP t38" will go through the sip helper to create
expects for RTP and RTCP.

It is okay to create RTP and RTCP expects for "183", whose master
connection source port is 5060, and destination port is 5060.

In the "183" message, port in Contact header changes to 50601 (from the
original 5060). So the following requests e.g. PRACK and ACK are sent to
port 50601. It is a different conntrack (let call Conntrack 2) from the
original INVITE (let call Conntrack 1) due to the port difference.

In this example, after the call is established, there is RTP stream but no
RTCP stream for Conntrack 1, so the RTP expect created upon "183" is
cleared, and RTCP expect created for Conntrack 1 retains.

When "re-INVITE with SDP t38" arrives to create RTP&RTCP expects, current
ALG implementation will call nf_ct_expect_related() for RTP and RTCP. The
expects tuples are identical to those for Conntrack 1. RTP expect for
Conntrack 2 succeeds in creation as the one for Conntrack 1 has been
removed. RTCP expect for Conntrack 2 fails in creation because it has
idential tuples and 'conflict' with the one retained for Conntrack 1. And
then result in a failure in processing of the re-INVITE.

2)

    SERVER A                 CPE

       |      REGISTER     |
  5060 |<------------------| 5060  ==> CT1
       |       200         |
  5060 |------------------>| 5060
       |                   |
       |   INVITE SDP(1)   |
  5060 |<------------------| 5060
       | 300(multi choice) |
  5060 |------------------>| 5060                    SERVER B
       |       ACK         |
  5060 |<------------------| 5060
                                  |    INVITE SDP(2)    |
                             5060 |-------------------->| 5060  ==> CT2
                                  |       100           |
                             5060 |<--------------------| 5060
                                  | 200(contact changes)|
                             5060 |<--------------------| 5060
                                  |       ACK           |
                             5060 |-------------------->| 50601 ==> CT3
                                  |                     |
                                  |<--- RTP stream ---->|
                                  |                     |
                                  |       BYE           |
                             5060 |<--------------------| 50601
                                  |       200           |
                             5060 |-------------------->| 50601
       |   INVITE SDP(3)   |
  5060 |<------------------| 5060  ==> CT1

CPE sends an INVITE request(1) to Server A, and creates a RTP&RTCP expect
pair for this Conntrack 1 (CT1). Server A responds 300 to redirect to
Server B. The RTP&RTCP expect pairs created on CT1 are removed upon 300
response.

CPE sends the INVITE request(2) to Server B, and creates an expect pair
for the new conntrack (due to destination address difference), let call
CT2. Server B changes the port to 50601 in 200 OK response, and the
following requests ACK and BYE from CPE are sent to 50601. The call is
established. There is RTP stream and no RTCP stream. So RTP expect is
removed and RTCP expect for CT2 retains.

As BYE request is sent from port 50601, it is another conntrack, let call
CT3, different from CT2 due to the port difference. So the BYE request will
not remove the RTCP expect for CT2.

Then another outgoing call is made, with the same RTP port being used (not
definitely but possibly). CPE firstly sends the INVITE request(3) to Server
A, and tries to create a RTP&RTCP expect pairs for this CT1. In current ALG
implementation, the RTCP expect for CT1 fails in creation because it
'conflicts' with the residual one for CT2. As a result the INVITE request
fails to send.

Signed-off-by: xiao ruizhu <katrina.xiaorz@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-16 13:16:59 +02:00
Vlad Buslov
c1a970d06f net: sched: Fix NULL-pointer dereference in tc_indr_block_ing_cmd()
After recent refactoring of block offlads infrastructure, indr_dev->block
pointer is dereferenced before it is verified to be non-NULL. Example stack
trace where this behavior leads to NULL-pointer dereference error when
creating vxlan dev on system with mlx5 NIC with offloads enabled:

[ 1157.852938] ==================================================================
[ 1157.866877] BUG: KASAN: null-ptr-deref in tc_indr_block_ing_cmd.isra.41+0x9c/0x160
[ 1157.880877] Read of size 4 at addr 0000000000000090 by task ip/3829
[ 1157.901637] CPU: 22 PID: 3829 Comm: ip Not tainted 5.2.0-rc6+ #488
[ 1157.914438] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[ 1157.929031] Call Trace:
[ 1157.938318]  dump_stack+0x9a/0xeb
[ 1157.948362]  ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160
[ 1157.960262]  ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160
[ 1157.972082]  __kasan_report+0x176/0x192
[ 1157.982513]  ? tc_indr_block_ing_cmd.isra.41+0x9c/0x160
[ 1157.994348]  kasan_report+0xe/0x20
[ 1158.004324]  tc_indr_block_ing_cmd.isra.41+0x9c/0x160
[ 1158.015950]  ? tcf_block_setup+0x430/0x430
[ 1158.026558]  ? kasan_unpoison_shadow+0x30/0x40
[ 1158.037464]  __tc_indr_block_cb_register+0x5f5/0xf20
[ 1158.049288]  ? mlx5e_rep_indr_tc_block_unbind+0xa0/0xa0 [mlx5_core]
[ 1158.062344]  ? tc_indr_block_dev_put.part.47+0x5c0/0x5c0
[ 1158.074498]  ? rdma_roce_rescan_device+0x20/0x20 [ib_core]
[ 1158.086580]  ? br_device_event+0x98/0x480 [bridge]
[ 1158.097870]  ? strcmp+0x30/0x50
[ 1158.107578]  mlx5e_nic_rep_netdevice_event+0xdd/0x180 [mlx5_core]
[ 1158.120212]  notifier_call_chain+0x6d/0xa0
[ 1158.130753]  register_netdevice+0x6fc/0x7e0
[ 1158.141322]  ? netdev_change_features+0xa0/0xa0
[ 1158.152218]  ? vxlan_config_apply+0x210/0x310 [vxlan]
[ 1158.163593]  __vxlan_dev_create+0x2ad/0x520 [vxlan]
[ 1158.174770]  ? vxlan_changelink+0x490/0x490 [vxlan]
[ 1158.185870]  ? rcu_read_unlock+0x60/0x60 [vxlan]
[ 1158.196798]  vxlan_newlink+0x99/0xf0 [vxlan]
[ 1158.207303]  ? __vxlan_dev_create+0x520/0x520 [vxlan]
[ 1158.218601]  ? rtnl_create_link+0x3d0/0x450
[ 1158.228900]  __rtnl_newlink+0x8a7/0xb00
[ 1158.238701]  ? stack_access_ok+0x35/0x80
[ 1158.248450]  ? rtnl_link_unregister+0x1a0/0x1a0
[ 1158.258735]  ? find_held_lock+0x6d/0xd0
[ 1158.268379]  ? is_bpf_text_address+0x67/0xf0
[ 1158.278330]  ? lock_acquire+0xc1/0x1f0
[ 1158.287686]  ? is_bpf_text_address+0x5/0xf0
[ 1158.297449]  ? is_bpf_text_address+0x86/0xf0
[ 1158.307310]  ? kernel_text_address+0xec/0x100
[ 1158.317155]  ? arch_stack_walk+0x92/0xe0
[ 1158.326497]  ? __kernel_text_address+0xe/0x30
[ 1158.336213]  ? unwind_get_return_address+0x2f/0x50
[ 1158.346267]  ? create_prof_cpu_mask+0x20/0x20
[ 1158.355936]  ? arch_stack_walk+0x92/0xe0
[ 1158.365117]  ? stack_trace_save+0x8a/0xb0
[ 1158.374272]  ? stack_trace_consume_entry+0x80/0x80
[ 1158.384226]  ? match_held_lock+0x33/0x210
[ 1158.393216]  ? kasan_unpoison_shadow+0x30/0x40
[ 1158.402593]  rtnl_newlink+0x53/0x80
[ 1158.410925]  rtnetlink_rcv_msg+0x3a5/0x600
[ 1158.419777]  ? validate_linkmsg+0x400/0x400
[ 1158.428620]  ? find_held_lock+0x6d/0xd0
[ 1158.437117]  ? match_held_lock+0x1b/0x210
[ 1158.445760]  ? validate_linkmsg+0x400/0x400
[ 1158.454642]  netlink_rcv_skb+0xc7/0x1f0
[ 1158.463150]  ? netlink_ack+0x470/0x470
[ 1158.471538]  ? netlink_deliver_tap+0x1f3/0x5a0
[ 1158.480607]  netlink_unicast+0x2ae/0x350
[ 1158.489099]  ? netlink_attachskb+0x340/0x340
[ 1158.497935]  ? _copy_from_iter_full+0xde/0x3b0
[ 1158.506945]  ? __virt_addr_valid+0xb6/0xf0
[ 1158.515578]  ? __check_object_size+0x159/0x240
[ 1158.524515]  netlink_sendmsg+0x4d3/0x630
[ 1158.532879]  ? netlink_unicast+0x350/0x350
[ 1158.541400]  ? netlink_unicast+0x350/0x350
[ 1158.549805]  sock_sendmsg+0x94/0xa0
[ 1158.557561]  ___sys_sendmsg+0x49d/0x570
[ 1158.565625]  ? copy_msghdr_from_user+0x210/0x210
[ 1158.574457]  ? __fput+0x1e2/0x330
[ 1158.581948]  ? __kasan_slab_free+0x130/0x180
[ 1158.590407]  ? kmem_cache_free+0xb6/0x2d0
[ 1158.598574]  ? mark_lock+0xc7/0x790
[ 1158.606177]  ? task_work_run+0xcf/0x100
[ 1158.614165]  ? exit_to_usermode_loop+0x102/0x110
[ 1158.622954]  ? __lock_acquire+0x963/0x1ee0
[ 1158.631199]  ? lockdep_hardirqs_on+0x260/0x260
[ 1158.639777]  ? match_held_lock+0x1b/0x210
[ 1158.647918]  ? lockdep_hardirqs_on+0x260/0x260
[ 1158.656501]  ? match_held_lock+0x1b/0x210
[ 1158.664643]  ? __fget_light+0xa6/0xe0
[ 1158.672423]  ? __sys_sendmsg+0xd2/0x150
[ 1158.680334]  __sys_sendmsg+0xd2/0x150
[ 1158.688063]  ? __ia32_sys_shutdown+0x30/0x30
[ 1158.696435]  ? lock_downgrade+0x2e0/0x2e0
[ 1158.704541]  ? mark_held_locks+0x1a/0x90
[ 1158.712611]  ? mark_held_locks+0x1a/0x90
[ 1158.720619]  ? do_syscall_64+0x1e/0x2c0
[ 1158.728530]  do_syscall_64+0x78/0x2c0
[ 1158.736254]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1158.745414] RIP: 0033:0x7f62d505cb87
[ 1158.753070] Code: 64 89 02 48 c7 c0 ff ff ff ff eb b9 0f 1f 80 00 00 00 00 8b 05 6a 2b 2c 00 48 63 d2 48 63 ff 85 c0 75 18 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 59 f3 c3 0f 1f 80 00 00[87/1817]
 48 89 f3 48
[ 1158.780924] RSP: 002b:00007fffd9832268 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 1158.793204] RAX: ffffffffffffffda RBX: 000000005d26048f RCX: 00007f62d505cb87
[ 1158.805111] RDX: 0000000000000000 RSI: 00007fffd98322d0 RDI: 0000000000000003
[ 1158.817055] RBP: 0000000000000000 R08: 0000000000000001 R09: 0000000000000006
[ 1158.828987] R10: 00007f62d50ce260 R11: 0000000000000246 R12: 0000000000000001
[ 1158.840909] R13: 000000000067e540 R14: 0000000000000000 R15: 000000000067ed20
[ 1158.852873] ==================================================================

Introduce new function tcf_block_non_null_shared() that verifies block
pointer before dereferencing it to obtain index. Use the function in
tc_indr_block_ing_cmd() to prevent NULL pointer dereference.

Fixes: 955bcb6ea0 ("drivers: net: use flow block API")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-12 15:21:53 -07:00
Petar Penkov
63f9ba1bf8 net: fib_rules: do not flow dissect local packets
Rules matching on loopback iif do not need early flow dissection as the
packet originates from the host. Stop counting such rules in
fib_rule_requires_fldissect

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-11 14:22:53 -07:00
Linus Torvalds
237f83dfbe Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Some highlights from this development cycle:

   1) Big refactoring of ipv6 route and neigh handling to support
      nexthop objects configurable as units from userspace. From David
      Ahern.

   2) Convert explored_states in BPF verifier into a hash table,
      significantly decreased state held for programs with bpf2bpf
      calls, from Alexei Starovoitov.

   3) Implement bpf_send_signal() helper, from Yonghong Song.

   4) Various classifier enhancements to mvpp2 driver, from Maxime
      Chevallier.

   5) Add aRFS support to hns3 driver, from Jian Shen.

   6) Fix use after free in inet frags by allocating fqdirs dynamically
      and reworking how rhashtable dismantle occurs, from Eric Dumazet.

   7) Add act_ctinfo packet classifier action, from Kevin
      Darbyshire-Bryant.

   8) Add TFO key backup infrastructure, from Jason Baron.

   9) Remove several old and unused ISDN drivers, from Arnd Bergmann.

  10) Add devlink notifications for flash update status to mlxsw driver,
      from Jiri Pirko.

  11) Lots of kTLS offload infrastructure fixes, from Jakub Kicinski.

  12) Add support for mv88e6250 DSA chips, from Rasmus Villemoes.

  13) Various enhancements to ipv6 flow label handling, from Eric
      Dumazet and Willem de Bruijn.

  14) Support TLS offload in nfp driver, from Jakub Kicinski, Dirk van
      der Merwe, and others.

  15) Various improvements to axienet driver including converting it to
      phylink, from Robert Hancock.

  16) Add PTP support to sja1105 DSA driver, from Vladimir Oltean.

  17) Add mqprio qdisc offload support to dpaa2-eth, from Ioana
      Radulescu.

  18) Add devlink health reporting to mlx5, from Moshe Shemesh.

  19) Convert stmmac over to phylink, from Jose Abreu.

  20) Add PTP PHC (Physical Hardware Clock) support to mlxsw, from
      Shalom Toledo.

  21) Add nftables SYNPROXY support, from Fernando Fernandez Mancera.

  22) Convert tcp_fastopen over to use SipHash, from Ard Biesheuvel.

  23) Track spill/fill of constants in BPF verifier, from Alexei
      Starovoitov.

  24) Support bounded loops in BPF, from Alexei Starovoitov.

  25) Various page_pool API fixes and improvements, from Jesper Dangaard
      Brouer.

  26) Just like ipv4, support ref-countless ipv6 route handling. From
      Wei Wang.

  27) Support VLAN offloading in aquantia driver, from Igor Russkikh.

  28) Add AF_XDP zero-copy support to mlx5, from Maxim Mikityanskiy.

  29) Add flower GRE encap/decap support to nfp driver, from Pieter
      Jansen van Vuuren.

  30) Protect against stack overflow when using act_mirred, from John
      Hurley.

  31) Allow devmap map lookups from eBPF, from Toke Høiland-Jørgensen.

  32) Use page_pool API in netsec driver, Ilias Apalodimas.

  33) Add Google gve network driver, from Catherine Sullivan.

  34) More indirect call avoidance, from Paolo Abeni.

  35) Add kTLS TX HW offload support to mlx5, from Tariq Toukan.

  36) Add XDP_REDIRECT support to bnxt_en, from Andy Gospodarek.

  37) Add MPLS manipulation actions to TC, from John Hurley.

  38) Add sending a packet to connection tracking from TC actions, and
      then allow flower classifier matching on conntrack state. From
      Paul Blakey.

  39) Netfilter hw offload support, from Pablo Neira Ayuso"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (2080 commits)
  net/mlx5e: Return in default case statement in tx_post_resync_params
  mlx5: Return -EINVAL when WARN_ON_ONCE triggers in mlx5e_tls_resync().
  net: dsa: add support for BRIDGE_MROUTER attribute
  pkt_sched: Include const.h
  net: netsec: remove static declaration for netsec_set_tx_de()
  net: netsec: remove superfluous if statement
  netfilter: nf_tables: add hardware offload support
  net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
  net: flow_offload: add flow_block_cb_is_busy() and use it
  net: sched: remove tcf block API
  drivers: net: use flow block API
  net: sched: use flow block API
  net: flow_offload: add flow_block_cb_{priv, incref, decref}()
  net: flow_offload: add list handling functions
  net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
  net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
  net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
  net: flow_offload: add flow_block_cb_setup_simple()
  net: hisilicon: Add an tx_desc to adapt HI13X1_GMAC
  net: hisilicon: Add an rx_desc to adapt HI13X1_GMAC
  ...
2019-07-11 10:55:49 -07:00
Pablo Neira Ayuso
c9626a2cbd netfilter: nf_tables: add hardware offload support
This patch adds hardware offload support for nftables through the
existing netdev_ops->ndo_setup_tc() interface, the TC_SETUP_CLSFLOWER
classifier and the flow rule API. This hardware offload support is
available for the NFPROTO_NETDEV family and the ingress hook.

Each nftables expression has a new ->offload interface, that is used to
populate the flow rule object that is attached to the transaction
object.

There is a new per-table NFT_TABLE_F_HW flag, that is set on to offload
an entire table, including all of its chains.

This patch supports for basic metadata (layer 3 and 4 protocol numbers),
5-tuple payload matching and the accept/drop actions; this also includes
basechain hardware offload only.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:51 -07:00
Pablo Neira Ayuso
f9e30088d2 net: flow_offload: rename tc_cls_flower_offload to flow_cls_offload
And any other existing fields in this structure that refer to tc.
Specifically:

* tc_cls_flower_offload_flow_rule() to flow_cls_offload_flow_rule().
* TC_CLSFLOWER_* to FLOW_CLS_*.
* tc_cls_common_offload to tc_cls_common_offload.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:51 -07:00
Pablo Neira Ayuso
0d4fd02e71 net: flow_offload: add flow_block_cb_is_busy() and use it
This patch adds a function to check if flow block callback is already in
use.  Call this new function from flow_block_cb_setup_simple() and from
drivers.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
722d36e6e2 net: sched: remove tcf block API
Unused, now replaced by flow block API.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
955bcb6ea0 drivers: net: use flow block API
This patch updates flow_block_cb_setup_simple() to use the flow block API.
Several drivers are also adjusted to use it.

This patch introduces the per-driver list of flow blocks to account for
blocks that are already in use.

Remove tc_block_offload alias.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
67bd0d5ea7 net: flow_offload: add flow_block_cb_{priv, incref, decref}()
This patch completes the flow block API to introduce:

* flow_block_cb_priv() to access callback private data.
* flow_block_cb_incref() to bump reference counter on this flow block.
* flow_block_cb_decref() to decrement the reference counter.

These functions are taken from the existing tcf_block_cb_priv(),
tcf_block_cb_incref() and tcf_block_cb_decref().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
da3eeb904f net: flow_offload: add list handling functions
This patch adds the list handling functions for the flow block API:

* flow_block_cb_lookup() allows drivers to look up for existing flow blocks.
* flow_block_cb_add() adds a flow block to the per driver list to be registered
  by the core.
* flow_block_cb_remove() to remove a flow block from the list of existing
  flow blocks per driver and to request the core to unregister this.

The flow block API also annotates the netns this flow block belongs to.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
d63db30c85 net: flow_offload: add flow_block_cb_alloc() and flow_block_cb_free()
Add a new helper function to allocate flow_block_cb objects.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
32f8c4093a net: flow_offload: rename TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_*
Rename from TCF_BLOCK_BINDER_TYPE_* to FLOW_BLOCK_BINDER_TYPE_* and
remove temporary tcf_block_binder_type alias.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
9c0e189ec9 net: flow_offload: rename TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND
Rename from TC_BLOCK_{UN}BIND to FLOW_BLOCK_{UN}BIND and remove
temporary tc_block_command alias.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Pablo Neira Ayuso
4e95bc268b net: flow_offload: add flow_block_cb_setup_simple()
Most drivers do the same thing to set up the flow block callbacks, this
patch adds a helper function to do this.

This preparation patch reduces the number of changes to adapt the
existing drivers to use the flow block callback API.

This new helper function takes a flow block list per-driver, which is
set to NULL until this driver list is used.

This patch also introduces the flow_block_command and
flow_block_binder_type enumerations, which are renamed to use
FLOW_BLOCK_* in follow up patches.

There are three definitions (aliases) in order to reduce the number of
updates in this patch, which go away once drivers are fully adapted to
use this flow block API.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 14:38:50 -07:00
Paul Blakey
75a56758d6 net/flow_dissector: add connection tracking dissection
Retreives connection tracking zone, mark, label, and state from
a SKB.

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 12:11:59 -07:00
Paul Blakey
b57dc7c13e net/sched: Introduce action ct
Allow sending a packet to conntrack module for connection tracking.

The packet will be marked with conntrack connection's state, and
any metadata such as conntrack mark and label. This state metadata
can later be matched against with tc classifers, for example with the
flower classifier as below.

In addition to committing new connections the user can optionally
specific a zone to track within, set a mark/label and configure nat
with an address range and port range.

Usage is as follows:
$ tc qdisc add dev ens1f0_0 ingress
$ tc qdisc add dev ens1f0_1 ingress

$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 0 proto ip \
  flower ip_proto tcp ct_state -trk \
  action ct zone 2 pipe \
  action goto chain 2
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 2 proto ip \
  flower ct_state +trk+new \
  action ct zone 2 commit mark 0xbb nat src addr 5.5.5.7 pipe \
  action mirred egress redirect dev ens1f0_1
$ tc filter add dev ens1f0_0 ingress \
  prio 1 chain 2 proto ip \
  flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
  action ct nat pipe \
  action mirred egress redirect dev ens1f0_1

$ tc filter add dev ens1f0_1 ingress \
  prio 1 chain 0 proto ip \
  flower ip_proto tcp ct_state -trk \
  action ct zone 2 pipe \
  action goto chain 1
$ tc filter add dev ens1f0_1 ingress \
  prio 1 chain 1 proto ip \
  flower ct_zone 2 ct_mark 0xbb ct_state +trk+est \
  action ct nat pipe \
  action mirred egress redirect dev ens1f0_0

Signed-off-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Yossi Kuperman <yossiku@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>

Changelog:
V5->V6:
	Added CONFIG_NF_DEFRAG_IPV6 in handle fragments ipv6 case
V4->V5:
	Reordered nf_conntrack_put() in tcf_ct_skb_nfct_cached()
V3->V4:
	Added strict_start_type for act_ct policy
V2->V3:
	Fixed david's comments: Removed extra newline after rcu in tcf_ct_params , and indent of break in act_ct.c
V1->V2:
	Fixed parsing of ranges TCA_CT_NAT_IPV6_MAX as 'else' case overwritten ipv4 max
	Refactored NAT_PORT_MIN_MAX range handling as well
	Added ipv4/ipv6 defragmentation
	Removed extra skb pull push of nw offset in exectute nat
	Refactored tcf_ct_skb_network_trim after pull
	Removed TCA_ACT_CT define

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 12:11:59 -07:00
Parav Pandit
e41b6bf3cd devlink: Introduce PCI VF port flavour and port attribute
In an eswitch, PCI VF may have port which is normally represented using
a representor netdevice.
To have better visibility of eswitch port, its association with VF,
and its representor netdevice, introduce a PCI VF port flavour.

When devlink port flavour is PCI VF, fill up PCI VF attributes of
the port.

Extend port name creation using PCI PF and VF number scheme on best
effort basis, so that vendor drivers can skip defining their own scheme.

$ devlink port show
pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0
pci/0000:05:00.0/1: type eth netdev eth1 flavour pcivf pfnum 0 vfnum 0
pci/0000:05:00.0/2: type eth netdev eth2 flavour pcivf pfnum 0 vfnum 1

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 12:02:13 -07:00
Parav Pandit
98fd2d6563 devlink: Introduce PCI PF port flavour and port attribute
In an eswitch, PCI PF may have port which is normally represented
using a representor netdevice.
To have better visibility of eswitch port, its association with
PF and a representor netdevice, introduce a PCI PF port
flavour and port attriute.

When devlink port flavour is PCI PF, fill up PCI PF attributes of the
port.

Extend port name creation using PCI PF number on best effort basis.
So that vendor drivers can skip defining their own scheme.

$ devlink port show
pci/0000:05:00.0/0: type eth netdev eth0 flavour pcipf pfnum 0

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 12:02:13 -07:00
Parav Pandit
378ef01b5f devlink: Refactor physical port attributes
To support additional devlink port flavours and to support few common
and few different port attributes, move physical port attributes to a
different structure.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-09 12:02:13 -07:00
Dirk van der Merwe
b5d9a834f4 net/tls: don't clear TX resync flag on error
Introduce a return code for the tls_dev_resync callback.

When the driver TX resync fails, kernel can retry the resync again
until it succeeds.  This prevents drivers from attempting to offload
TLS packets if the connection is known to be out of sync.

We don't worry about the RX resync since they will be retried naturally
as more encrypted records get received.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:21:09 -07:00
Xin Long
e55f4b8bf4 sctp: rename sp strm_interleave to ep intl_enable
Like other endpoint features, strm_interleave should be moved to
sctp_endpoint and renamed to intl_enable.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:16:25 -07:00
Xin Long
da1f6d4de7 sctp: rename asoc intl_enable to asoc peer.intl_capable
To keep consistent with other asoc features, we move intl_enable
to peer.intl_capable in asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:16:25 -07:00
Xin Long
1c13475368 sctp: remove prsctp_enable from asoc
Like reconf_enable, prsctp_enable should also be removed from asoc,
as asoc->peer.prsctp_capable has taken its job.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:16:24 -07:00
Xin Long
a96701fb35 sctp: remove reconf_enable from asoc
asoc's reconf support is actually decided by the 4-shakehand negotiation,
not something that users can set by sockopt. asoc->peer.reconf_capable is
working for this. So remove it from asoc.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 20:16:24 -07:00
John Hurley
2a2ea50870 net: sched: add mpls manipulation actions to TC
Currently, TC offers the ability to match on the MPLS fields of a packet
through the use of the flow_dissector_key_mpls struct. However, as yet, TC
actions do not allow the modification or manipulation of such fields.

Add a new module that registers TC action ops to allow manipulation of
MPLS. This includes the ability to push and pop headers as well as modify
the contents of new or existing headers. A further action to decrement the
TTL field of an MPLS header is also provided with a new helper added to
support this.

Examples of the usage of the new action with flower rules to push and pop
MPLS labels are:

tc filter add dev eth0 protocol ip parent ffff: flower \
    action mpls push protocol mpls_uc label 123  \
    action mirred egress redirect dev eth1

tc filter add dev eth0 protocol mpls_uc parent ffff: flower \
    action mpls pop protocol ipv4  \
    action mirred egress redirect dev eth1

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:50:13 -07:00
David S. Miller
af144a9834 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two cases of overlapping changes, nothing fancy.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:48:57 -07:00
Willem de Bruijn
59c820b231 ipv6: elide flowlabel check if no exclusive leases exist
Processes can request ipv6 flowlabels with cmsg IPV6_FLOWINFO.
If not set, by default an autogenerated flowlabel is selected.

Explicit flowlabels require a control operation per label plus a
datapath check on every connection (every datagram if unconnected).
This is particularly expensive on unconnected sockets multiplexing
many flows, such as QUIC.

In the common case, where no lease is exclusive, the check can be
safely elided, as both lease request and check trivially succeed.
Indeed, autoflowlabel does the same even with exclusive leases.

Elide the check if no process has requested an exclusive lease.

fl6_sock_lookup previously returns either a reference to a lease or
NULL to denote failure. Modify to return a real error and update
all callers. On return NULL, they can use the label and will elide
the atomic_dec in fl6_sock_release.

This is an optimization. Robust applications still have to revert to
requesting leases if the fast path fails due to an exclusive lease.

Changes RFC->v1:
  - use static_key_false_deferred to rate limit jump label operations
    - call static_key_deferred_flush to stop timers on exit
  - move decrement out of RCU context
  - defer optimization also if opt data is associated with a lease
  - updated all fp6_sock_lookup callers, not just udp

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:38:03 -07:00
Linus Torvalds
c84ca912b0 Keyrings namespacing
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXRU89Pu3V2unywtrAQIdBBAAmMBsrfv+LUN4Vru/D6KdUO4zdYGcNK6m
 S56bcNfP6oIDEj6HrNNnzKkWIZpdZ61Odv1zle96+v4WZ/6rnLCTpcsdaFNTzaoO
 YT2jk7jplss0ImrMv1DSoykGqO3f0ThMIpGCxHKZADGSu0HMbjSEh+zLPV4BaMtT
 BVuF7P3eZtDRLdDtMtYcgvf5UlbdoBEY8w1FUjReQx8hKGxVopGmCo5vAeiY8W9S
 ybFSZhPS5ka33ynVrLJH2dqDo5A8pDhY8I4bdlcxmNtRhnPCYZnuvTqeAzyUKKdI
 YN9zJeDu1yHs9mi8dp45NPJiKy6xLzWmUwqH8AvR8MWEkrwzqbzNZCEHZ41j74hO
 YZWI0JXi72cboszFvOwqJERvITKxrQQyVQLPRQE2vVbG0bIZPl8i7oslFVhitsl+
 evWqHb4lXY91rI9cC6JIXR1OiUjp68zXPv7DAnxv08O+PGcioU1IeOvPivx8QSx4
 5aUeCkYIIAti/GISzv7xvcYh8mfO76kBjZSB35fX+R9DkeQpxsHmmpWe+UCykzWn
 EwhHQn86+VeBFP6RAXp8CgNCLbrwkEhjzXQl/70s1eYbwvK81VcpDAQ6+cjpf4Hb
 QUmrUJ9iE0wCNl7oqvJZoJvWVGlArvPmzpkTJk3N070X2R0T7x1WCsMlPDMJGhQ2
 fVHvA3QdgWs=
 =Push
 -----END PGP SIGNATURE-----

Merge tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull keyring namespacing from David Howells:
 "These patches help make keys and keyrings more namespace aware.

  Firstly some miscellaneous patches to make the process easier:

   - Simplify key index_key handling so that the word-sized chunks
     assoc_array requires don't have to be shifted about, making it
     easier to add more bits into the key.

   - Cache the hash value in the key so that we don't have to calculate
     on every key we examine during a search (it involves a bunch of
     multiplications).

   - Allow keying_search() to search non-recursively.

  Then the main patches:

   - Make it so that keyring names are per-user_namespace from the point
     of view of KEYCTL_JOIN_SESSION_KEYRING so that they're not
     accessible cross-user_namespace.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEYRING_NAME for this.

   - Move the user and user-session keyrings to the user_namespace
     rather than the user_struct. This prevents them propagating
     directly across user_namespaces boundaries (ie. the KEY_SPEC_*
     flags will only pick from the current user_namespace).

   - Make it possible to include the target namespace in which the key
     shall operate in the index_key. This will allow the possibility of
     multiple keys with the same description, but different target
     domains to be held in the same keyring.

     keyctl_capabilities() shows KEYCTL_CAPS1_NS_KEY_TAG for this.

   - Make it so that keys are implicitly invalidated by removal of a
     domain tag, causing them to be garbage collected.

   - Institute a network namespace domain tag that allows keys to be
     differentiated by the network namespace in which they operate. New
     keys that are of a type marked 'KEY_TYPE_NET_DOMAIN' are assigned
     the network domain in force when they are created.

   - Make it so that the desired network namespace can be handed down
     into the request_key() mechanism. This allows AFS, NFS, etc. to
     request keys specific to the network namespace of the superblock.

     This also means that the keys in the DNS record cache are
     thenceforth namespaced, provided network filesystems pass the
     appropriate network namespace down into dns_query().

     For DNS, AFS and NFS are good, whilst CIFS and Ceph are not. Other
     cache keyrings, such as idmapper keyrings, also need to set the
     domain tag - for which they need access to the network namespace of
     the superblock"

* tag 'keys-namespace-20190627' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  keys: Pass the network namespace into request_key mechanism
  keys: Network namespace domain tag
  keys: Garbage collect keys for which the domain has been removed
  keys: Include target namespace in match criteria
  keys: Move the user and user-session keyrings to the user_namespace
  keys: Namespace keyring names
  keys: Add a 'recurse' flag for keyring searches
  keys: Cache the hash value to avoid lots of recalculation
  keys: Simplify key description management
2019-07-08 19:36:47 -07:00
Al Viro
333f7909a8 coallocate socket_wq with socket itself
socket->wq is assign-once, set when we are initializing both
struct socket it's in and struct socket_wq it points to.  As the
matter of fact, the only reason for separate allocation was the
ability to RCU-delay freeing of socket_wq.  RCU-delaying the
freeing of socket itself gets rid of that need, so we can just
fold struct socket_wq into the end of struct socket and simplify
the life both for sock_alloc_inode() (one allocation instead of
two) and for tun/tap oddballs, where we used to embed struct socket
and struct socket_wq into the same structure (now - embedding just
the struct socket).

Note that reference to struct socket_wq in struct sock does remain
a reference - that's unchanged.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:25:19 -07:00
David S. Miller
17ccf9e31e Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-09

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Lots of libbpf improvements: i) addition of new APIs to attach BPF
   programs to tracing entities such as {k,u}probes or tracepoints,
   ii) improve specification of BTF-defined maps by eliminating the
   need for data initialization for some of the members, iii) addition
   of a high-level API for setting up and polling perf buffers for
   BPF event output helpers, all from Andrii.

2) Add "prog run" subcommand to bpftool in order to test-run programs
   through the kernel testing infrastructure of BPF, from Quentin.

3) Improve verifier for BPF sockaddr programs to support 8-byte stores
   for user_ip6 and msg_src_ip6 members given clang tends to generate
   such stores, from Stanislav.

4) Enable the new BPF JIT zero-extension optimization for further
   riscv64 ALU ops, from Luke.

5) Fix a bpftool json JIT dump crash on powerpc, from Jiri.

6) Fix an AF_XDP race in generic XDP's receive path, from Ilya.

7) Various smaller fixes from Ilya, Yue and Arnd.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 19:14:38 -07:00
Ilya Maximets
bf0bdd1343 xdp: fix race on generic receive path
Unlike driver mode, generic xdp receive could be triggered
by different threads on different CPU cores at the same time
leading to the fill and rx queue breakage. For example, this
could happen while sending packets from two processes to the
first interface of veth pair while the second part of it is
open with AF_XDP socket.

Need to take a lock for each generic receive to avoid race.

Fixes: c497176cb2 ("xsk: add Rx receive functions and poll support")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Tested-by: William Tu <u9012063@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-09 01:43:26 +02:00
Ivan Khoronzhuk
1da4bbeffe net: core: page_pool: add user refcnt and reintroduce page_pool_destroy
Jesper recently removed page_pool_destroy() (from driver invocation)
and moved shutdown and free of page_pool into xdp_rxq_info_unreg(),
in-order to handle in-flight packets/pages. This created an asymmetry
in drivers create/destroy pairs.

This patch reintroduce page_pool_destroy and add page_pool user
refcnt. This serves the purpose to simplify drivers error handling as
driver now drivers always calls page_pool_destroy() and don't need to
track if xdp_rxq_info_reg_mem_model() was unsuccessful.

This could be used for a special cases where a single RX-queue (with a
single page_pool) provides packets for two net_device'es, and thus
needs to register the same page_pool twice with two xdp_rxq_info
structures.

This patch is primarily to ease API usage for drivers. The recently
merged netsec driver, actually have a bug in this area, which is
solved by this API change.

This patch is a modified version of Ivan Khoronzhuk's original patch.

Link: https://lore.kernel.org/netdev/20190625175948.24771-2-ivan.khoronzhuk@linaro.org/
Fixes: 5c67bf0ec4 ("net: netsec: Use page_pool API")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Saeed Mahameed <saeedm@mellanox.com>
Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 14:58:04 -07:00
David S. Miller
47cfb90406 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Move bridge keys in nft_meta to nft_meta_bridge, from wenxu.

2) Support for bridge pvid matching, from wenxu.

3) Support for bridge vlan protocol matching, also from wenxu.

4) Add br_vlan_get_pvid_rcu(), to fetch the bridge port pvid
   from packet path.

5) Prefer specific family extension in nf_tables.

6) Autoload specific family extension in case it is missing.

7) Add synproxy support to nf_tables, from Fernando Fernandez Mancera.

8) Support for GRE encapsulation in IPVS, from Vadim Fedorenko.

9) ICMP handling for GRE encapsulation, from Julian Anastasov.

10) Remove unused parameter in nf_queue, from Florian Westphal.

11) Replace seq_printf() by seq_puts() in nf_log, from Markus Elfring.

12) Rename nf_SYNPROXY.h => nf_synproxy.h before this header becomes
    public.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-08 12:13:38 -07:00
Linus Torvalds
927ba67a63 Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Thomas Gleixner:
 "The timer and timekeeping departement delivers:

  Core:

   - The consolidation of the VDSO code into a generic library including
     the conversion of x86 and ARM64. Conversion of ARM and MIPS are en
     route through the relevant maintainer trees and should end up in
     5.4.

     This gets rid of the unnecessary different copies of the same code
     and brings all architectures on the same level of VDSO
     functionality.

   - Make the NTP user space interface more robust by restricting the
     TAI offset to prevent undefined behaviour. Includes a selftest.

   - Validate user input in the compat settimeofday() syscall to catch
     invalid values which would be turned into valid values by a
     multiplication overflow

   - Consolidate the time accessors

   - Small fixes, improvements and cleanups all over the place

  Drivers:

   - Support for the NXP system counter, TI davinci timer

   - Move the Microsoft HyperV clocksource/events code into the
     drivers/clocksource directory so it can be shared between x86 and
     ARM64.

   - Overhaul of the Tegra driver

   - Delay timer support for IXP4xx

   - Small fixes, improvements and cleanups as usual"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (71 commits)
  time: Validate user input in compat_settimeofday()
  timer: Document TIMER_PINNED
  clocksource/drivers: Continue making Hyper-V clocksource ISA agnostic
  clocksource/drivers: Make Hyper-V clocksource ISA agnostic
  MAINTAINERS: Fix Andy's surname and the directory entries of VDSO
  hrtimer: Use a bullet for the returns bullet list
  arm64: vdso: Fix compilation with clang older than 8
  arm64: compat: Fix __arch_get_hw_counter() implementation
  arm64: Fix __arch_get_hw_counter() implementation
  lib/vdso: Make delta calculation work correctly
  MAINTAINERS: Add entry for the generic VDSO library
  arm64: compat: No need for pre-ARMv7 barriers on an ARMv8 system
  arm64: vdso: Remove unnecessary asm-offsets.c definitions
  vdso: Remove superfluous #ifdef __KERNEL__ in vdso/datapage.h
  clocksource/drivers/davinci: Add support for clocksource
  clocksource/drivers/davinci: Add support for clockevents
  clocksource/drivers/tegra: Set up maximum-ticks limit properly
  clocksource/drivers/tegra: Cycles can't be 0
  clocksource/drivers/tegra: Restore base address before cleanup
  clocksource/drivers/tegra: Add verbose definition for 1MHz constant
  ...
2019-07-08 11:06:29 -07:00
Arnd Bergmann
bef8e26392 bpf: avoid unused variable warning in tcp_bpf_rtt()
When CONFIG_BPF is disabled, we get a warning for an unused
variable:

In file included from drivers/target/target_core_device.c:26:
include/net/tcp.h:2226:19: error: unused variable 'tp' [-Werror,-Wunused-variable]
        struct tcp_sock *tp = tcp_sk(sk);

The variable is only used in one place, so it can be
replaced with its value there to avoid the warning.

Fixes: 23729ff231 ("bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-08 17:18:03 +02:00
Spoorthi Ravishankar Koppad
302975cba1 Bluetooth: Add support for LE ping feature
Changes made to add HCI Write Authenticated Payload timeout
command for LE Ping feature.

As per the Core Specification 5.0 Volume 2 Part E Section 7.3.94,
the following code changes implements
HCI Write Authenticated Payload timeout command for LE Ping feature.

Signed-off-by: Spoorthi Ravishankar Koppad <spoorthix.k@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-07-06 15:29:12 +02:00
David S. Miller
e3b60ffbc1 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2019-07-05

1) A lot of work to remove indirections from the xfrm code.
   From Florian Westphal.

2) Fix a WARN_ON with ipv6 that triggered because of a
   forgotten break statement. From Florian Westphal.

3)  Remove xfrmi_init_net, it is not needed.
    From Li RongQing.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-05 15:01:15 -07:00
wenxu
30e103fe24 netfilter: nft_meta: move bridge meta keys into nft_meta_bridge
Separate bridge meta key from nft_meta to meta_bridge to avoid a
dependency between the bridge module and nft_meta when using the bridge
API available through include/linux/if_bridge.h

Signed-off-by: wenxu <wenxu@ucloud.cn>
Reviewed-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05 21:34:47 +02:00
Fernando Fernandez Mancera
ad49d86e07 netfilter: nf_tables: Add synproxy support
Add synproxy support for nf_tables. This behaves like the iptables
synproxy target but it is structured in a way that allows us to propose
improvements in the future.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-05 21:34:23 +02:00
David S. Miller
c4cde5804d Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-07-03

The following pull-request contains BPF updates for your *net-next* tree.

There is a minor merge conflict in mlx5 due to 8960b38932 ("linux/dim:
Rename externally used net_dim members") which has been pulled into your
tree in the meantime, but resolution seems not that bad ... getting current
bpf-next out now before there's coming more on mlx5. ;) I'm Cc'ing Saeed
just so he's aware of the resolution below:

** First conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
  static int mlx5e_open_cq(struct mlx5e_channel *c,
                           struct dim_cq_moder moder,
                           struct mlx5e_cq_param *param,
                           struct mlx5e_cq *cq)
  =======
  int mlx5e_open_cq(struct mlx5e_channel *c, struct net_dim_cq_moder moder,
                    struct mlx5e_cq_param *param, struct mlx5e_cq *cq)
  >>>>>>> e5a3e259ef

Resolution is to take the second chunk and rename net_dim_cq_moder into
dim_cq_moder. Also the signature for mlx5e_open_cq() in ...

  drivers/net/ethernet/mellanox/mlx5/core/en.h +977

... and in mlx5e_open_xsk() ...

  drivers/net/ethernet/mellanox/mlx5/core/en/xsk/setup.c +64

... needs the same rename from net_dim_cq_moder into dim_cq_moder.

** Second conflict in drivers/net/ethernet/mellanox/mlx5/core/en_main.c:

  <<<<<<< HEAD
          int cpu = cpumask_first(mlx5_comp_irq_get_affinity_mask(priv->mdev, ix));
          struct dim_cq_moder icocq_moder = {0, 0};
          struct net_device *netdev = priv->netdev;
          struct mlx5e_channel *c;
          unsigned int irq;
  =======
          struct net_dim_cq_moder icocq_moder = {0, 0};
  >>>>>>> e5a3e259ef

Take the second chunk and rename net_dim_cq_moder into dim_cq_moder
as well.

Let me know if you run into any issues. Anyway, the main changes are:

1) Long-awaited AF_XDP support for mlx5e driver, from Maxim.

2) Addition of two new per-cgroup BPF hooks for getsockopt and
   setsockopt along with a new sockopt program type which allows more
   fine-grained pass/reject settings for containers. Also add a sock_ops
   callback that can be selectively enabled on a per-socket basis and is
   executed for every RTT to help tracking TCP statistics, both features
   from Stanislav.

3) Follow-up fix from loops in precision tracking which was not propagating
   precision marks and as a result verifier assumed that some branches were
   not taken and therefore wrongly removed as dead code, from Alexei.

4) Fix BPF cgroup release synchronization race which could lead to a
   double-free if a leaf's cgroup_bpf object is released and a new BPF
   program is attached to the one of ancestor cgroups in parallel, from Roman.

5) Support for bulking XDP_TX on veth devices which improves performance
   in some cases by around 9%, from Toshiaki.

6) Allow for lookups into BPF devmap and improve feedback when calling into
   bpf_redirect_map() as lookup is now performed right away in the helper
   itself, from Toke.

7) Add support for fq's Earliest Departure Time to the Host Bandwidth
   Manager (HBM) sample BPF program, from Lawrence.

8) Various cleanups and minor fixes all over the place from many others.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 12:48:21 -07:00
Vincent Bernat
07a4ddec3c bonding: add an option to specify a delay between peer notifications
Currently, gratuitous ARP/ND packets are sent every `miimon'
milliseconds. This commit allows a user to specify a custom delay
through a new option, `peer_notif_delay'.

Like for `updelay' and `downdelay', this delay should be a multiple of
`miimon' to avoid managing an additional work queue. The configuration
logic is copied from `updelay' and `downdelay'. However, the default
value cannot be set using a module parameter: Netlink or sysfs should
be used to configure this feature.

When setting `miimon' to 100 and `peer_notif_delay' to 500, we can
observe the 500 ms delay is respected:

    20:30:19.354693 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
    20:30:19.874892 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
    20:30:20.394919 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28
    20:30:20.914963 ARP, Request who-has 203.0.113.10 tell 203.0.113.10, length 28

In bond_mii_monitor(), I have tried to keep the lock logic readable.
The change is due to the fact we cannot rely on a notification to
lower the value of `bond->send_peer_notif' as `NETDEV_NOTIFY_PEERS' is
only triggered once every N times, while we need to decrement the
counter each time.

iproute2 also needs to be updated to be able to specify this new
attribute through `ip link'.

Signed-off-by: Vincent Bernat <vincent@bernat.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-04 12:30:48 -07:00
Florian Westphal
0d9cb300ac netfilter: nf_queue: remove unused hook entries pointer
Its not used anywhere, so remove this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-07-04 02:29:49 +02:00
Paolo Abeni
e473093639 inet: factor out inet_send_prepare()
The same code is replicated verbatim in multiple places, and the next
patches will introduce an additional user for it. Factor out a
helper and use it where appropriate. No functional change intended.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 13:51:54 -07:00
David S. Miller
c3ead2df97 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-07-03

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix the interpreter to properly handle BPF_ALU32 | BPF_ARSH
   on BE architectures, from Jiong.

2) Fix several bugs in the x32 BPF JIT for handling shifts by 0,
   from Luke and Xi.

3) Fix NULL pointer deref in btf_type_is_resolve_source_only(),
   from Stanislav.

4) Properly handle the check that forwarding is enabled on the device
   in bpf_ipv6_fib_lookup() helper code, from Anton.

5) Fix UAPI bpf_prog_info fields alignment for archs that have 16 bit
   alignment such as m68k, from Baruch.

6) Fix kernel hanging in unregister_netdevice loop while unregistering
   device bound to XDP socket, from Ilya.

7) Properly terminate tail update in xskq_produce_flush_desc(), from Nathan.

8) Fix broken always_inline handling in test_lwt_seg6local, from Jiri.

9) Fix bpftool to use correct argument in cgroup errors, from Jakub.

10) Fix detaching dummy prog in XDP redirect sample code, from Prashant.

11) Add Jonathan to AF_XDP reviewers, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-03 12:09:00 -07:00
Stanislav Fomichev
23729ff231 bpf: add BPF_CGROUP_SOCK_OPS callback that is executed on every RTT
Performance impact should be minimal because it's under a new
BPF_SOCK_OPS_RTT_CB_FLAG flag that has to be explicitly enabled.

Suggested-by: Eric Dumazet <edumazet@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Priyaranjan Jha <priyarjha@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03 16:52:01 +02:00
Ilya Maximets
455302d1c9 xdp: fix hang while unregistering device bound to xdp socket
Device that bound to XDP socket will not have zero refcount until the
userspace application will not close it. This leads to hang inside
'netdev_wait_allrefs()' if device unregistering requested:

  # ip link del p1
  < hang on recvmsg on netlink socket >

  # ps -x | grep ip
  5126  pts/0    D+   0:00 ip link del p1

  # journalctl -b

  Jun 05 07:19:16 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1

  Jun 05 07:19:27 kernel:
  unregister_netdevice: waiting for p1 to become free. Usage count = 1
  ...

Fix that by implementing NETDEV_UNREGISTER event notification handler
to properly clean up all the resources and unref device.

This should also allow socket killing via ss(8) utility.

Fixes: 965a990984 ("xsk: add support for bind for Rx")
Signed-off-by: Ilya Maximets <i.maximets@samsung.com>
Acked-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-07-03 15:10:55 +02:00
Thomas Gleixner
3419240495 Merge branch 'timers/vdso' into timers/core
so the hyper-v clocksource update can be applied.
2019-07-03 10:50:21 +02:00
Ilias Apalodimas
bb005f2a70 net: page_pool: add helper function for retrieving dma direction
Since the dma direction is stored in page pool params, offer an API
helper for driver that choose not to keep track of it locally

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 19:27:08 -07:00
Jakub Kicinski
acd3e96d53 net/tls: make sure offload also gets the keys wiped
Commit 86029d10af ("tls: zero the crypto information from tls_context
before freeing") added memzero_explicit() calls to clear the key material
before freeing struct tls_context, but it missed tls_device.c has its
own way of freeing this structure. Replace the missing free.

Fixes: 86029d10af ("tls: zero the crypto information from tls_context before freeing")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 19:22:36 -07:00
Vandana BN
88405680ec net:gue.h:Fix shifting signed 32-bit value by 31 bits problem
Fix GUE_PFLAG_REMCSUM to use "U" cast to avoid shifting signed
32-bit value by 31 bits problem.

Signed-off-by: Vandana BN <bnvandana@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 10:58:23 -07:00
Eric Dumazet
a346abe051 ipv6: icmp: allow flowlabel reflection in echo replies
Extend flowlabel_reflect bitmask to allow conditional
reflection of incoming flowlabels in echo replies.

Note this has precedence against auto flowlabels.

Add flowlabel_reflect enum to replace hard coded
values.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 10:54:51 -07:00
Vandana BN
40f6a2cb9c net: dst.h: Fix shifting signed 32-bit value by 31 bits problem
Fix DST_FEATURE_ECN_CA to use "U" cast to avoid shifting signed
32-bit value by 31 bits problem.

Signed-off-by: Vandana BN <bnvandana@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-07-01 10:48:34 -07:00
Florian Westphal
c7b37c769d xfrm: remove get_mtu indirection from xfrm_type
esp4_get_mtu and esp6_get_mtu are exactly the same, the only difference
is a single sizeof() (ipv4 vs. ipv6 header).

Merge both into xfrm_state_mtu() and remove the indirection.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-07-01 06:16:40 +02:00
David S. Miller
954a5a0294 mlx5e-updates-2019-06-28
This series adds some misc updates for mlx5e driver
 
 1) Allow adding the same mac more than once in MPFS table
 2) Move to HW checksumming advertising
 3) Report netdevice MPLS features
 4) Correct physical port name of the PF representor
 5) Reduce stack usage in mlx5_eswitch_termtbl_create
 6) Refresh TIR improvement for representors
 7) Expose same physical switch_id for all representors
 -----BEGIN PGP SIGNATURE-----
 
 iQEzBAABCAAdFiEEGhZs6bAKwk/OTgTpSD+KveBX+j4FAl0WnOAACgkQSD+KveBX
 +j6xQwgAuarc5NCi8jg9m+yXFXj/kr1cPwZ6Pxadi8hAd3NbEMGtutdFZIXsIXZZ
 y0uYxzkrCqXUUh2ZnpE09YlBISLyVSt3WGdqgJn1wel//O33gS6GpYXwVGOfgL7n
 +d9FCYvrun6v6aOHM5ZGxQ+qxHzupz5k4F0r7fz2Gsd+JEgL58nwc8ERSbOTbZMO
 TLO1pcxlXWwGSqd5uc4AHi8hZTvuzWl/Fm5hOTP9gx/Sl3UaYWa3WiTgj5uOD5Zt
 956Xqk0LLwSaiKVAsFjIa7HHWOaDLnVmbmTUzhv82hharqmvPW1CIWlx1gv01KoP
 wMWoyyoy7cTyyrWNozWEed/14LyEVA==
 =S4bJ
 -----END PGP SIGNATURE-----

Merge tag 'mlx5e-updates-2019-06-28' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5e-updates-2019-06-28

This series adds some misc updates for mlx5e driver

1) Allow adding the same mac more than once in MPFS table
2) Move to HW checksumming advertising
3) Report netdevice MPLS features
4) Correct physical port name of the PF representor
5) Reduce stack usage in mlx5_eswitch_termtbl_create
6) Refresh TIR improvement for representors
7) Expose same physical switch_id for all representors
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-30 18:41:13 -07:00
Florian Westphal
b60a77386b net: make skb_dst_force return true when dst is refcounted
netfilter did not expect that skb_dst_force() can cause skb to lose its
dst entry.

I got a bug report with a skb->dst NULL dereference in netfilter
output path.  The backtrace contains nf_reinject(), so the dst might have
been cleared when skb got queued to userspace.

Other users were fixed via
if (skb_dst(skb)) {
	skb_dst_force(skb);
	if (!skb_dst(skb))
		goto handle_err;
}

But I think its preferable to make the 'dst might be cleared' part
of the function explicit.

In netfilter case, skb with a null dst is expected when queueing in
prerouting hook, so drop skb for the other hooks.

v2:
 v1 of this patch returned true in case skb had no dst entry.
 Eric said:
   Say if we have two skb_dst_force() calls for some reason
   on the same skb, only the first one will return false.

 This now returns false even when skb had no dst, as per Erics
 suggestion, so callers might need to check skb_dst() first before
 skb_dst_force().

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-29 11:01:35 -07:00
Saeed Mahameed
4f5d1beadc Merge branch 'mlx5-next' of git://git.kernel.org/pub/scm/linux/kernel/git/mellanox/linux
Misc updates from mlx5-next branch:

1) E-Switch vport metadata support for source vport matching
2) Convert mkey_table to XArray
3) Shared IRQs and to use single IRQ for all async EQs

Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-06-28 16:03:54 -07:00
John Hurley
720f22fed8 net: sched: refactor reinsert action
The TC_ACT_REINSERT return type was added as an in-kernel only option to
allow a packet ingress or egress redirect. This is used to avoid
unnecessary skb clones in situations where they are not required. If a TC
hook returns this code then the packet is 'reinserted' and no skb consume
is carried out as no clone took place.

This return type is only used in act_mirred. Rather than have the reinsert
called from the main datapath, call it directly in act_mirred. Instead of
returning TC_ACT_REINSERT, change the type to the new TC_ACT_CONSUMED
which tells the caller that the packet has been stolen by another process
and that no consume call is required.

Moving all redirect calls to the act_mirred code is in preparation for
tracking recursion created by act_mirred.

Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28 14:36:25 -07:00
David S. Miller
7c3d310d8f Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter fixes for net:

1) Fix memleak reported by syzkaller when registering IPVS hooks,
   patch from Julian Anastasov.

2) Fix memory leak in start_sync_thread, also from Julian.

3) Fix conntrack deletion via ctnetlink, from Felix Kaechele.

4) Fix reject for ICMP due to incorrect checksum handling, from
   He Zhe.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-28 13:36:43 -07:00
David S. Miller
d96ff269a0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The new route handling in ip_mc_finish_output() from 'net' overlapped
with the new support for returning congestion notifications from BPF
programs.

In order to handle this I had to take the dev_loopback_xmit() calls
out of the switch statement.

The aquantia driver conflicts were simple overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-27 21:06:39 -07:00
Maxim Mikityanskiy
4bce4e5cb6 xsk: Return the whole xdp_desc from xsk_umem_consume_tx
Some drivers want to access the data transmitted in order to implement
acceleration features of the NICs. It is also useful in AF_XDP TX flow.

Change the xsk_umem_consume_tx API to return the whole xdp_desc, that
contains the data pointer, length and DMA address, instead of only the
latter two. Adapt the implementation of i40e and ixgbe to this change.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Cc: Björn Töpel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27 22:53:27 +02:00
Maxim Mikityanskiy
d57d76428a xsk: Add API to check for available entries in FQ
Add a function that checks whether the Fill Ring has the specified
amount of descriptors available. It will be useful for mlx5e that wants
to check in advance, whether it can allocate a bulk of RX descriptors,
to get the best performance.

Signed-off-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-27 22:53:26 +02:00
Nicolas Dichtel
9b1c1ef13b ipv6: constify rt6_nexthop()
There is no functional change in this patch, it only prepares the next one.

rt6_nexthop() will be used by ip6_dst_lookup_neigh(), which uses const
variables.

Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Reported-by: kbuild test robot <lkp@intel.com>
Acked-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26 13:26:08 -07:00
David Howells
9b24261051 keys: Network namespace domain tag
Create key domain tags for network namespaces and make it possible to
automatically tag keys that are used by networked services (e.g. AF_RXRPC,
AFS, DNS) with the default network namespace if not set by the caller.

This allows keys with the same description but in different namespaces to
coexist within a keyring.

Signed-off-by: David Howells <dhowells@redhat.com>
cc: netdev@vger.kernel.org
cc: linux-nfs@vger.kernel.org
cc: linux-cifs@vger.kernel.org
cc: linux-afs@lists.infradead.org
2019-06-26 21:02:33 +01:00
Stephen Suryaputra
5b18f12898 ipv4: reset rt_iif for recirculated mcast/bcast out pkts
Multicast or broadcast egress packets have rt_iif set to the oif. These
packets might be recirculated back as input and lookup to the raw
sockets may fail because they are bound to the incoming interface
(skb_iif). If rt_iif is not zero, during the lookup, inet_iif() function
returns rt_iif instead of skb_iif. Hence, the lookup fails.

v2: Make it non vrf specific (David Ahern). Reword the changelog to
    reflect it.
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-26 12:40:10 -07:00
Julian Anastasov
5db7c8b9f9 ipvs: fix tinfo memory leak in start_sync_thread
syzkaller reports for memory leak in start_sync_thread [1]

As Eric points out, kthread may start and stop before the
threadfn function is called, so there is no chance the
data (tinfo in our case) to be released in thread.

Fix this by releasing tinfo in the controlling code instead.

[1]
BUG: memory leak
unreferenced object 0xffff8881206bf700 (size 32):
 comm "syz-executor761", pid 7268, jiffies 4294943441 (age 20.470s)
 hex dump (first 32 bytes):
   00 40 7c 09 81 88 ff ff 80 45 b8 21 81 88 ff ff  .@|......E.!....
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
 backtrace:
   [<0000000057619e23>] kmemleak_alloc_recursive include/linux/kmemleak.h:55 [inline]
   [<0000000057619e23>] slab_post_alloc_hook mm/slab.h:439 [inline]
   [<0000000057619e23>] slab_alloc mm/slab.c:3326 [inline]
   [<0000000057619e23>] kmem_cache_alloc_trace+0x13d/0x280 mm/slab.c:3553
   [<0000000086ce5479>] kmalloc include/linux/slab.h:547 [inline]
   [<0000000086ce5479>] start_sync_thread+0x5d2/0xe10 net/netfilter/ipvs/ip_vs_sync.c:1862
   [<000000001a9229cc>] do_ip_vs_set_ctl+0x4c5/0x780 net/netfilter/ipvs/ip_vs_ctl.c:2402
   [<00000000ece457c8>] nf_sockopt net/netfilter/nf_sockopt.c:106 [inline]
   [<00000000ece457c8>] nf_setsockopt+0x4c/0x80 net/netfilter/nf_sockopt.c:115
   [<00000000942f62d4>] ip_setsockopt net/ipv4/ip_sockglue.c:1258 [inline]
   [<00000000942f62d4>] ip_setsockopt+0x9b/0xb0 net/ipv4/ip_sockglue.c:1238
   [<00000000a56a8ffd>] udp_setsockopt+0x4e/0x90 net/ipv4/udp.c:2616
   [<00000000fa895401>] sock_common_setsockopt+0x38/0x50 net/core/sock.c:3130
   [<0000000095eef4cf>] __sys_setsockopt+0x98/0x120 net/socket.c:2078
   [<000000009747cf88>] __do_sys_setsockopt net/socket.c:2089 [inline]
   [<000000009747cf88>] __se_sys_setsockopt net/socket.c:2086 [inline]
   [<000000009747cf88>] __x64_sys_setsockopt+0x26/0x30 net/socket.c:2086
   [<00000000ded8ba80>] do_syscall_64+0x76/0x1a0 arch/x86/entry/common.c:301
   [<00000000893b4ac8>] entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reported-by: syzbot+7e2e50c8adfccd2e5041@syzkaller.appspotmail.com
Suggested-by: Eric Biggers <ebiggers@kernel.org>
Fixes: 998e7a7680 ("ipvs: Use kthread_run() instead of doing a double-fork via kernel_thread()")
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-25 02:32:27 +02:00
Pablo Neira Ayuso
1c5ba67d22 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Resolve conflict between d2912cb15b ("treewide: Replace GPLv2
boilerplate/reference with SPDX - rule 500") removing the GPL disclaimer
and fe03d47456 ("Update my email address") which updates Jozsef
Kadlecsik's email.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-25 01:32:59 +02:00
Stefano Brivio
1e47b4837f ipv6: Dump route exceptions if requested
Since commit 2b760fcf5c ("ipv6: hook up exception table to store dst
cache"), route exceptions reside in a separate hash table, and won't be
found by walking the FIB, so they won't be dumped to userspace on a
RTM_GETROUTE message.

This causes 'ip -6 route list cache' and 'ip -6 route flush cache' to
have no function anymore:

 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 539sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 536sec mtu 1500 pref medium
 # ip -6 route list cache
 # ip -6 route flush cache
 # ip -6 route get fc00:3::1
 fc00:3::1 via fc00:1::2 dev veth_A-R1 src fc00:1::1 metric 1024 expires 520sec mtu 1400 pref medium
 # ip -6 route get fc00:4::1
 fc00:4::1 via fc00:2::2 dev veth_A-R2 src fc00:2::1 metric 1024 expires 519sec mtu 1500 pref medium

because iproute2 lists cached routes using RTM_GETROUTE, and flushes them
by listing all the routes, and deleting them with RTM_DELROUTE one by one.

If cached routes are requested using the RTM_F_CLONED flag together with
strict checking, or if no strict checking is requested (and hence we can't
consistently apply filters), look up exceptions in the hash table
associated with the current fib6_info in rt6_dump_route(), and, if present
and not expired, add them to the dump.

We might be unable to dump all the entries for a given node in a single
message, so keep track of how many entries were handled for the current
node in fib6_walker, and skip that amount in case we start from the same
partially dumped node.

When a partial dump restarts, as the starting node might change when
'sernum' changes, we have no guarantee that we need to skip the same
amount of in-node entries. Therefore, we need two counters, and we need to
zero the in-node counter if the node from which the dump is resumed
differs.

Note that, with the current version of iproute2, this only fixes the
'ip -6 route list cache': on a flush command, iproute2 doesn't pass
RTM_F_CLONED and, due to this inconsistency, 'ip -6 route flush cache' is
still unable to fetch the routes to be flushed. This will be addressed in
a patch for iproute2.

To flush cached routes, a procfs entry could be introduced instead: that's
how it works for IPv4. We already have a rt6_flush_exception() function
ready to be wired to it. However, this would not solve the issue for
listing.

Versions of iproute2 and kernel tested:

                    iproute2
kernel             4.14.0   4.15.0   4.19.0   5.0.0   5.1.0    5.1.0, patched
 3.18    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.4     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.9     list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.14    list        +        +        +        +       +            +
         flush       +        +        +        +       +            +
 4.15    list
         flush
 4.19    list
         flush
 5.0     list
         flush
 5.1     list
         flush
 with    list        +        +        +        +       +            +
 fix     flush       +        +        +                             +

v7:
  - Explain usage of "skip" counters in commit message (suggested by
    David Ahern)

v6:
  - Rebase onto net-next, use recently introduced nexthop walker
  - Make rt6_nh_dump_exceptions() a separate function (suggested by David
    Ahern)

v5:
  - Use dump_routes and dump_exceptions from filter, ignore NLM_F_MATCH,
    update test results (flushing works with iproute2 < 5.0.0 now)

v4:
  - Split NLM_F_MATCH and strict check handling in separate patches
  - Filter routes using RTM_F_CLONED: if it's not set, only return
    non-cached routes, and if it's set, only return cached routes:
    change requested by David Ahern and Martin Lau. This implies that
    iproute2 needs a separate patch to be able to flush IPv6 cached
    routes. This is not ideal because we can't fix the breakage caused
    by 2b760fcf5c entirely in kernel. However, two years have passed
    since then, and this makes it more tolerable

v3:
  - More descriptive comment about expired exceptions in rt6_dump_route()
  - Swap return values of rt6_dump_route() (suggested by Martin Lau)
  - Don't zero skip_in_node in case we don't dump anything in a given pass
    (also suggested by Martin Lau)
  - Remove check on RTM_F_CLONED altogether: in the current UAPI semantic,
    it's just a flag to indicate the route was cloned, not to filter on
    routes

v2: Add tracking of number of entries to be skipped in current node after
    a partial dump. As we restart from the same node, if not all the
    exceptions for a given node fit in a single message, the dump will
    not terminate, as suggested by Martin Lau. This is a concrete
    possibility, setting up a big number of exceptions for the same route
    actually causes the issue, suggested by David Ahern.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: 2b760fcf5c ("ipv6: hook up exception table to store dst cache")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24 10:18:49 -07:00
Stefano Brivio
ee28906fd7 ipv4: Dump route exceptions if requested
Since commit 4895c771c7 ("ipv4: Add FIB nexthop exceptions."), cached
exception routes are stored as a separate entity, so they are not dumped
on a FIB dump, even if the RTM_F_CLONED flag is passed.

This implies that the command 'ip route list cache' doesn't return any
result anymore.

If the RTM_F_CLONED is passed, and strict checking requested, retrieve
nexthop exception routes and dump them. If no strict checking is
requested, filtering can't be performed consistently: dump everything in
that case.

With this, we need to add an argument to the netlink callback in order to
track how many entries were already dumped for the last leaf included in
a partial netlink dump.

A single additional argument is sufficient, even if we traverse logically
nested structures (nexthop objects, hash table buckets, bucket chains): it
doesn't matter if we stop in the middle of any of those, because they are
always traversed the same way. As an example, s_i values in [], s_fa
values in ():

  node (fa) #1 [1]
    nexthop #1
    bucket #1 -> #0 in chain (1)
    bucket #2 -> #0 in chain (2) -> #1 in chain (3) -> #2 in chain (4)
    bucket #3 -> #0 in chain (5) -> #1 in chain (6)

    nexthop #2
    bucket #1 -> #0 in chain (7) -> #1 in chain (8)
    bucket #2 -> #0 in chain (9)
  --
  node (fa) #2 [2]
    nexthop #1
    bucket #1 -> #0 in chain (1) -> #1 in chain (2)
    bucket #2 -> #0 in chain (3)

it doesn't matter if we stop at (3), (4), (7) for "node #1", or at (2)
for "node #2": walking flattens all that.

It would even be possible to drop the distinction between the in-tree
(s_i) and in-node (s_fa) counter, but a further improvement might
advise against this. This is only as accurate as the existing tracking
mechanism for leaves: if a partial dump is restarted after exceptions
are removed or expired, we might skip some non-dumped entries.

To improve this, we could attach a 'sernum' attribute (similar to the
one used for IPv6) to nexthop entities, and bump this counter whenever
exceptions change: having a distinction between the two counters would
make this more convenient.

Listing of exception routes (modified routes pre-3.5) was tested against
these versions of kernel and iproute2:

                    iproute2
kernel         4.14.0   4.15.0   4.19.0   5.0.0   5.1.0
 3.5-rc4         +        +        +        +       +
 4.4
 4.9
 4.14
 4.15
 4.19
 5.0
 5.1
 fixed           +        +        +        +       +

v7:
   - Move loop over nexthop objects to route.c, and pass struct fib_info
     and table ID to it, not a struct fib_alias (suggested by David Ahern)
   - While at it, note that the NULL check on fa->fa_info is redundant,
     and the check on RTNH_F_DEAD is also not consistent with what's done
     with regular route listing: just keep it for nhc_flags
   - Rename entry point function for dumping exceptions to
     fib_dump_info_fnhe(), and rearrange arguments for consistency with
     fib_dump_info()
   - Rename fnhe_dump_buckets() to fnhe_dump_bucket() and make it handle
     one bucket at a time
   - Expand commit message to describe why we can have a single "skip"
     counter for all exceptions stored in bucket chains in nexthop objects
     (suggested by David Ahern)

v6:
   - Rebased onto net-next
   - Loop over nexthop paths too. Move loop over fnhe buckets to route.c,
     avoids need to export rt_fill_info() and to touch exceptions from
     fib_trie.c. Pass NULL as flow to rt_fill_info(), it now allows that
     (suggested by David Ahern)

Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions.")
Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24 10:18:48 -07:00
Stefano Brivio
564c91f7e5 fib_frontend, ip6_fib: Select routes or exceptions dump from RTM_F_CLONED
The following patches add back the ability to dump IPv4 and IPv6 exception
routes, and we need to allow selection of regular routes or exceptions.

Use RTM_F_CLONED as filter to decide whether to dump routes or exceptions:
iproute2 passes it in dump requests (except for IPv6 cache flush requests,
this will be fixed in iproute2) and this used to work as long as
exceptions were stored directly in the FIB, for both IPv4 and IPv6.

Caveat: if strict checking is not requested (that is, if the dump request
doesn't go through ip_valid_fib_dump_req()), we can't filter on protocol,
tables or route types.

In this case, filtering on RTM_F_CLONED would be inconsistent: we would
fix 'ip route list cache' by returning exception routes and at the same
time introduce another bug in case another selector is present, e.g. on
'ip route list cache table main' we would return all exception routes,
without filtering on tables.

Keep this consistent by applying no filters at all, and dumping both
routes and exceptions, if strict checking is not requested. iproute2
currently filters results anyway, and no unwanted results will be
presented to the user. The kernel will just dump more data than needed.

v7: No changes

v6: Rebase onto net-next, no changes

v5: New patch: add dump_routes and dump_exceptions flags in filter and
    simply clear the unwanted one if strict checking is enabled, don't
    ignore NLM_F_MATCH and don't set filter_set if NLM_F_MATCH is set.
    Skip filtering altogether if no strict checking is requested:
    selecting routes or exceptions only would be inconsistent with the
    fact we can't filter on tables.

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24 10:18:48 -07:00
Dirk van der Merwe
9354544cbc net/tls: fix page double free on TX cleanup
With commit 94850257cf ("tls: Fix tls_device handling of partial records")
a new path was introduced to cleanup partial records during sk_proto_close.
This path does not handle the SW KTLS tx_list cleanup.

This is unnecessary though since the free_resources calls for both
SW and offload paths will cleanup a partial record.

The visible effect is the following warning, but this bug also causes
a page double free.

    WARNING: CPU: 7 PID: 4000 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
    RIP: 0010:sk_stream_kill_queues+0x103/0x110
    RSP: 0018:ffffb6df87e07bd0 EFLAGS: 00010206
    RAX: 0000000000000000 RBX: ffff8c21db4971c0 RCX: 0000000000000007
    RDX: ffffffffffffffa0 RSI: 000000000000001d RDI: ffff8c21db497270
    RBP: ffff8c21db497270 R08: ffff8c29f4748600 R09: 000000010020001a
    R10: ffffb6df87e07aa0 R11: ffffffff9a445600 R12: 0000000000000007
    R13: 0000000000000000 R14: ffff8c21f03f2900 R15: ffff8c21f03b8df0
    Call Trace:
     inet_csk_destroy_sock+0x55/0x100
     tcp_close+0x25d/0x400
     ? tcp_check_oom+0x120/0x120
     tls_sk_proto_close+0x127/0x1c0
     inet_release+0x3c/0x60
     __sock_release+0x3d/0xb0
     sock_close+0x11/0x20
     __fput+0xd8/0x210
     task_work_run+0x84/0xa0
     do_exit+0x2dc/0xb90
     ? release_sock+0x43/0x90
     do_group_exit+0x3a/0xa0
     get_signal+0x295/0x720
     do_signal+0x36/0x610
     ? SYSC_recvfrom+0x11d/0x130
     exit_to_usermode_loop+0x69/0xb0
     do_syscall_64+0x173/0x180
     entry_SYSCALL_64_after_hwframe+0x3d/0xa2
    RIP: 0033:0x7fe9b9abc10d
    RSP: 002b:00007fe9b19a1d48 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca
    RAX: fffffffffffffe00 RBX: 0000000000000006 RCX: 00007fe9b9abc10d
    RDX: 0000000000000002 RSI: 0000000000000080 RDI: 00007fe948003430
    RBP: 00007fe948003410 R08: 00007fe948003430 R09: 0000000000000000
    R10: 0000000000000000 R11: 0000000000000246 R12: 00005603739d9080
    R13: 00007fe9b9ab9f90 R14: 00007fe948003430 R15: 0000000000000000

Fixes: 94850257cf ("tls: Fix tls_device handling of partial records")
Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-24 07:20:45 -07:00
Wei Wang
7d9e5f4221 ipv6: convert major tx path to use RT6_LOOKUP_F_DST_NOREF
For tx path, in most cases, we still have to take refcnt on the dst
cause the caller is caching the dst somewhere. But it still is
beneficial to make use of RT6_LOOKUP_F_DST_NOREF flag while doing the
route lookup. It is cause this flag prevents manipulating refcnt on
net->ipv6.ip6_null_entry when doing fib6_rule_lookup() to traverse each
routing table. The null_entry is a shared object and constant updates on
it cause false sharing.

We converted the current major lookup function ip6_route_output_flags()
to make use of RT6_LOOKUP_F_DST_NOREF.

Together with the change in the rx path, we see noticable performance
boost:
I ran synflood tests between 2 hosts under the same switch. Both hosts
have 20G mlx NIC, and 8 tx/rx queues.
Sender sends pure SYN flood with random src IPs and ports using trafgen.
Receiver has a simple TCP listener on the target port.
Both hosts have multiple custom rules:
- For incoming packets, only local table is traversed.
- For outgoing packets, 3 tables are traversed to find the route.
The packet processing rate on the receiver is as follows:
- Before the fix: 3.78Mpps
- After the fix:  5.50Mpps

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23 13:24:17 -07:00
Wei Wang
d64a1f574a ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logic
This patch specifically converts the rule lookup logic to honor this
flag and not release refcnt when traversing each rule and calling
lookup() on each routing table.
Similar to previous patch, we also need some special handling of dst
entries in uncached list because there is always 1 refcnt taken for them
even if RT6_LOOKUP_F_DST_NOREF flag is set.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23 13:24:17 -07:00
Wei Wang
0e09edcce7 ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route()
This new flag is to instruct the route lookup function to not take
refcnt on the dst entry. The user which does route lookup with this flag
must properly use rcu protection.
ip6_pol_route() is the major route lookup function for both tx and rx
path.
In this function:
Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and
directly return the route entry. The caller should be holding rcu lock
when using this flag, and decide whether to take refcnt or not.

One note on the dst cache in the uncached_list:
As uncached_list does not consume refcnt, one refcnt is always returned
back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set.
Uncached dst is only possible in the output path. So in such call path,
caller MUST check if the dst is in the uncached_list before assuming
that there is no refcnt taken on the returned dst.

Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Acked-by: Mahesh Bandewar <maheshb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23 13:24:17 -07:00
Qian Cai
08003d0b63 inet: fix compilation warnings in fqdir_pre_exit()
The linux-next commit "inet: fix various use-after-free in defrags
units" [1] introduced compilation warnings,

./include/net/inet_frag.h:117:1: warning: 'inline' is not at beginning
of declaration [-Wold-style-declaration]
 static void inline fqdir_pre_exit(struct fqdir *fqdir)
 ^~~~~~
In file included from ./include/net/netns/ipv4.h:10,
                 from ./include/net/net_namespace.h:20,
                 from ./include/linux/netdevice.h:38,
                 from ./include/linux/icmpv6.h:13,
                 from ./include/linux/ipv6.h:86,
                 from ./include/net/ipv6.h:12,
                 from ./include/rdma/ib_verbs.h:51,
                 from ./include/linux/mlx5/device.h:37,
                 from ./include/linux/mlx5/driver.h:51,
                 from
drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:37:

[1] https://lore.kernel.org/netdev/20190618180900.88939-3-edumazet@google.com/

Signed-off-by: Qian Cai <cai@lca.pw>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-23 11:12:12 -07:00
Ard Biesheuvel
438ac88009 net: fastopen: robustness and endianness fixes for SipHash
Some changes to the TCP fastopen code to make it more robust
against future changes in the choice of key/cookie size, etc.

- Instead of keeping the SipHash key in an untyped u8[] buffer
  and casting it to the right type upon use, use the correct
  type directly. This ensures that the key will appear at the
  correct alignment if we ever change the way these data
  structures are allocated. (Currently, they are only allocated
  via kmalloc so they always appear at the correct alignment)

- Use DIV_ROUND_UP when sizing the u64[] array to hold the
  cookie, so it is always of sufficient size, even if
  TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8.

- Drop the 'len' parameter from the tcp_fastopen_reset_cipher()
  function, which is no longer used.

- Add endian swabbing when setting the keys and calculating the hash,
  to ensure that cookie values are the same for a given key and
  source/destination address pair regardless of the endianness of
  the server.

Note that none of these are functional changes wrt the current
state of the code, with the exception of the swabbing, which only
affects big endian systems.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 16:30:37 -07:00
David S. Miller
92ad6325cb Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor SPDX change conflict.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-22 08:59:24 -04:00
Jason A. Donenfeld
9285ec4c8b timekeeping: Use proper clock specifier names in functions
This makes boot uniformly boottime and tai uniformly clocktai, to
address the remaining oversights.

Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Arnd Bergmann <arnd@arndb.de>
Link: https://lkml.kernel.org/r/20190621203249.3909-2-Jason@zx2c4.com
2019-06-22 12:11:27 +02:00
Linus Torvalds
c356dc4b54 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume
    Nault.

 2) Don't access the DDM interface unless the transceiver implements it
    in bnx2x, from Mauro S. M. Rodrigues.

 3) Don't double fetch 'len' from userspace in sock_getsockopt(), from
    JingYi Hou.

 4) Sign extension overflow in lio_core, from Colin Ian King.

 5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski.

 6) Fix epollout hang in hvsock, from Sunil Muthuswamy.

 7) Fix regression in default fib6_type, from David Ahern.

 8) Handle memory limits in tcp_fragment more appropriately, from Eric
    Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits)
  tcp: refine memory limit test in tcp_fragment()
  inet: clear num_timeout reqsk_alloc()
  net: mvpp2: debugfs: Add pmap to fs dump
  ipv6: Default fib6_type to RTN_UNICAST when not set
  net: hns3: Fix inconsistent indenting
  net/af_iucv: always register net_device notifier
  net/af_iucv: build proper skbs for HiperTransport
  net/af_iucv: remove GFP_DMA restriction for HiperTransport
  net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge()
  hvsock: fix epollout hang from race condition
  net/udp_gso: Allow TX timestamp with UDP GSO
  net: netem: fix use after free and double free with packet corruption
  net: netem: fix backlog accounting for corrupted GSO frames
  net: lio_core: fix potential sign-extension overflow on large shift
  tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb
  ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL
  ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL
  tun: wake up waitqueues after IFF_UP is set
  net: remove duplicate fetch in sock_getsockopt
  tipc: fix issues with early FAILOVER_MSG from peer
  ...
2019-06-21 22:23:35 -07:00
Linus Torvalds
c884d8ac7f SPDX update for 5.2-rc6
Another round of SPDX updates for 5.2-rc6
 
 Here is what I am guessing is going to be the last "big" SPDX update for
 5.2.  It contains all of the remaining GPLv2 and GPLv2+ updates that
 were "easy" to determine by pattern matching.  The ones after this are
 going to be a bit more difficult and the people on the spdx list will be
 discussing them on a case-by-case basis now.
 
 Another 5000+ files are fixed up, so our overall totals are:
 	Files checked:            64545
 	Files with SPDX:          45529
 
 Compared to the 5.1 kernel which was:
 	Files checked:            63848
 	Files with SPDX:          22576
 This is a huge improvement.
 
 Also, we deleted another 20000 lines of boilerplate license crud, always
 nice to see in a diffstat.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXQyQYA8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ymnGQCghETUBotn1p3hTjY56VEs6dGzpHMAnRT0m+lv
 kbsjBGEJpLbMRB2krnaU
 =RMcT
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx

Pull still more SPDX updates from Greg KH:
 "Another round of SPDX updates for 5.2-rc6

  Here is what I am guessing is going to be the last "big" SPDX update
  for 5.2. It contains all of the remaining GPLv2 and GPLv2+ updates
  that were "easy" to determine by pattern matching. The ones after this
  are going to be a bit more difficult and the people on the spdx list
  will be discussing them on a case-by-case basis now.

  Another 5000+ files are fixed up, so our overall totals are:
	Files checked:            64545
	Files with SPDX:          45529

  Compared to the 5.1 kernel which was:
	Files checked:            63848
	Files with SPDX:          22576

  This is a huge improvement.

  Also, we deleted another 20000 lines of boilerplate license crud,
  always nice to see in a diffstat"

* tag 'spdx-5.2-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/spdx: (65 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 507
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 506
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 505
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 504
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 503
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 502
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 501
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 499
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 498
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 497
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 496
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 495
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 491
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 490
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 489
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 488
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 487
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 486
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 485
  ...
2019-06-21 09:58:42 -07:00
David S. Miller
dca73a65a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-06-19

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) new SO_REUSEPORT_DETACH_BPF setsocktopt, from Martin.

2) BTF based map definition, from Andrii.

3) support bpf_map_lookup_elem for xskmap, from Jonathan.

4) bounded loops and scalar precision logic in the verifier, from Alexei.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-20 00:06:27 -04:00
Jesper Dangaard Brouer
497ad9f5b2 page_pool: fix compile warning when CONFIG_PAGE_POOL is disabled
Kbuild test robot reported compile warning:
 warning: no return statement in function returning non-void
in function page_pool_request_shutdown, when CONFIG_PAGE_POOL is disabled.

The fix makes the code a little more verbose, with a descriptive variable.

Fixes: 99c07c43c4 ("xdp: tracking page_pool resources and safe removal")
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 21:26:06 -04:00
Eric Dumazet
85f9aa7565 inet: clear num_timeout reqsk_alloc()
KMSAN caught uninit-value in tcp_create_openreq_child() [1]
This is caused by a recent change, combined by the fact
that TCP cleared num_timeout, num_retrans and sk fields only
when a request socket was about to be queued.

Under syncookie mode, a temporary request socket is used,
and req->num_timeout could contain garbage.

Lets clear these three fields sooner, there is really no
point trying to defer this and risk other bugs.

[1]

BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x191/0x1f0 lib/dump_stack.c:113
 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611
 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304
 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526
 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152
 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209
 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
 tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
 ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
 dst_input include/net/dst.h:439 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
 __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
 __netif_receive_skb net/core/dev.c:5095 [inline]
 process_backlog+0x721/0x1410 net/core/dev.c:5906
 napi_poll net/core/dev.c:6329 [inline]
 net_rx_action+0x738/0x1940 net/core/dev.c:6395
 __do_softirq+0x4ad/0x858 kernel/softirq.c:293
 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
 </IRQ>
 do_softirq kernel/softirq.c:338 [inline]
 __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
 ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
 dst_output include/net/dst.h:433 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
 tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
 __sock_release net/socket.c:601 [inline]
 sock_close+0x156/0x490 net/socket.c:1273
 __fput+0x4c9/0xba0 fs/file_table.c:280
 ____fput+0x37/0x40 fs/file_table.c:313
 task_work_run+0x22e/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:185 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
 prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
 entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x401d50
Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00
RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003
RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50
RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c
R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0
R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline]
 kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160
 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177
 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781
 reqsk_alloc include/net/request_sock.h:84 [inline]
 inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384
 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173
 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline]
 tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344
 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554
 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397
 ip6_input_finish net/ipv6/ip6_input.c:438 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447
 dst_input include/net/dst.h:439 [inline]
 ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272
 __netif_receive_skb_one_core net/core/dev.c:4981 [inline]
 __netif_receive_skb net/core/dev.c:5095 [inline]
 process_backlog+0x721/0x1410 net/core/dev.c:5906
 napi_poll net/core/dev.c:6329 [inline]
 net_rx_action+0x738/0x1940 net/core/dev.c:6395
 __do_softirq+0x4ad/0x858 kernel/softirq.c:293
 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052
 do_softirq kernel/softirq.c:338 [inline]
 __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190
 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32
 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline]
 ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117
 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167
 dst_output include/net/dst.h:433 [inline]
 NF_HOOK include/linux/netfilter.h:305 [inline]
 ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271
 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135
 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156
 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline]
 tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397
 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573
 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118
 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403
 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427
 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470
 __sock_release net/socket.c:601 [inline]
 sock_close+0x156/0x490 net/socket.c:1273
 __fput+0x4c9/0xba0 fs/file_table.c:280
 ____fput+0x37/0x40 fs/file_table.c:313
 task_work_run+0x22e/0x2a0 kernel/task_work.c:113
 tracehook_notify_resume include/linux/tracehook.h:185 [inline]
 exit_to_usermode_loop arch/x86/entry/common.c:168 [inline]
 prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199
 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279
 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305
 entry_SYSCALL_64_after_hwframe+0x63/0xe7

Fixes: 336c39a031 ("tcp: undo init congestion window on false SYNACK timeout")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 17:46:57 -04:00
Kevin Darbyshire-Bryant
16e5a266f5 net: sched: act_ctinfo: tidy UAPI definition
Remove some enums from the UAPI definition that were only used
internally and are NOT part of the UAPI.

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 17:11:01 -04:00
Laura Garcia Liebana
79ebb5bb4e netfilter: nf_tables: enable set expiration time for set elements
Currently, the expiration of every element in a set or map
is a read-only parameter generated at kernel side.

This change will permit to set a certain expiration date
per element that will be required, for example, during
stateful replication among several nodes.

This patch handles the NFTA_SET_ELEM_EXPIRATION in order
to configure the expiration parameter per element, or
will use the timeout in the case that the expiration
is not set.

Signed-off-by: Laura Garcia Liebana <nevola@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-19 17:48:36 +02:00
Eric Dumazet
d5dd88794a inet: fix various use-after-free in defrags units
syzbot reported another issue caused by my recent patches. [1]

The issue here is that fqdir_exit() is initiating a work queue
and immediately returns. A bit later cleanup_net() was able
to free the MIB (percpu data) and the whole struct net was freed,
but we had active frag timers that fired and triggered use-after-free.

We need to make sure that timers can catch fqdir->dead being set,
to bailout.

Since RCU is used for the reader side, this means
we want to respect an RCU grace period between these operations :

1) qfdir->dead = 1;

2) netns dismantle (freeing of various data structure)

This patch uses new new (struct pernet_operations)->pre_exit
infrastructure to ensures a full RCU grace period
happens between fqdir_pre_exit() and fqdir_exit()

This also means we can use a regular work queue, we no
longer need rcu_work.

Tested:

$ time for i in {1..1000}; do unshare -n /bin/false;done

real	0m2.585s
user	0m0.160s
sys	0m2.214s

[1]

BUG: KASAN: use-after-free in ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
Read of size 8 at addr ffff88808b9fe330 by task syz-executor.4/11860

CPU: 1 PID: 11860 Comm: syz-executor.4 Not tainted 5.2.0-rc2+ #22
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 ip_expire+0x73e/0x800 net/ipv4/ip_fragment.c:152
 call_timer_fn+0x193/0x720 kernel/time/timer.c:1322
 expire_timers kernel/time/timer.c:1366 [inline]
 __run_timers kernel/time/timer.c:1685 [inline]
 __run_timers kernel/time/timer.c:1653 [inline]
 run_timer_softirq+0x66f/0x1740 kernel/time/timer.c:1698
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 </IRQ>
RIP: 0010:tomoyo_domain_quota_is_ok+0x131/0x540 security/tomoyo/util.c:1035
Code: 24 4c 3b 65 d0 0f 84 9c 00 00 00 e8 19 1d 73 fe 49 8d 7c 24 18 48 ba 00 00 00 00 00 fc ff df 48 89 f8 48 c1 e8 03 0f b6 04 10 <48> 89 fa 83 e2 07 38 d0 7f 08 84 c0 0f 85 69 03 00 00 41 0f b6 5c
RSP: 0018:ffff88806ae079c0 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000000 RBX: 0000000000000010 RCX: ffffc9000e655000
RDX: dffffc0000000000 RSI: ffffffff82fd88a7 RDI: ffff888086202398
RBP: ffff88806ae07a00 R08: ffff88808b6c8700 R09: ffffed100d5c0f4d
R10: ffffed100d5c0f4c R11: 0000000000000000 R12: ffff888086202380
R13: 0000000000000030 R14: 00000000000000d3 R15: 0000000000000000
 tomoyo_supervisor+0x2e8/0xef0 security/tomoyo/common.c:2087
 tomoyo_audit_path_number_log security/tomoyo/file.c:235 [inline]
 tomoyo_path_number_perm+0x42f/0x520 security/tomoyo/file.c:734
 tomoyo_file_ioctl+0x23/0x30 security/tomoyo/tomoyo.c:335
 security_file_ioctl+0x77/0xc0 security/security.c:1370
 ksys_ioctl+0x57/0xd0 fs/ioctl.c:711
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4592c9
Code: fd b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 cb b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f8db5e44c78 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00000000004592c9
RDX: 0000000020000080 RSI: 00000000000089f1 RDI: 0000000000000006
RBP: 000000000075bf20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f8db5e456d4
R13: 00000000004cc770 R14: 00000000004d5cd8 R15: 00000000ffffffff

Allocated by task 9047:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc mm/slab.c:3326 [inline]
 kmem_cache_alloc+0x11a/0x6f0 mm/slab.c:3488
 kmem_cache_zalloc include/linux/slab.h:732 [inline]
 net_alloc net/core/net_namespace.c:386 [inline]
 copy_net_ns+0xed/0x340 net/core/net_namespace.c:426
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 2541:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3698
 net_free net/core/net_namespace.c:402 [inline]
 net_drop_ns.part.0+0x70/0x90 net/core/net_namespace.c:409
 net_drop_ns net/core/net_namespace.c:408 [inline]
 cleanup_net+0x538/0x960 net/core/net_namespace.c:571
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff88808b9fe100
 which belongs to the cache net_namespace of size 6784
The buggy address is located 560 bytes inside of
 6784-byte region [ffff88808b9fe100, ffff88808b9ffb80)
The buggy address belongs to the page:
page:ffffea00022e7f80 refcount:1 mapcount:0 mapping:ffff88821b6f60c0 index:0x0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea000256f288 ffffea0001bbef08 ffff88821b6f60c0
raw: 0000000000000000 ffff88808b9fe100 0000000100000001 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88808b9fe200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88808b9fe280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff88808b9fe300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                                     ^
 ffff88808b9fe380: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88808b9fe400: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 3c8fc87820 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:37:47 -04:00
Eric Dumazet
d7d99872c1 netns: add pre_exit method to struct pernet_operations
Current struct pernet_operations exit() handlers are highly
discouraged to call synchronize_rcu().

There are cases where we need them, and exit_batch() does
not help the common case where a single netns is dismantled.

This patch leverages the existing synchronize_rcu() call
in cleanup_net()

Calling optional ->pre_exit() method before ->exit() or
->exit_batch() allows to benefit from a single synchronize_rcu()
call.

Note that the synchronize_rcu() calls added in this patch
are only in error paths or slow paths.

Tested:

$ time for i in {1..1000}; do unshare -n /bin/false;done

real	0m2.612s
user	0m0.171s
sys	0m2.216s

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:37:47 -04:00
Jesper Dangaard Brouer
f033b688c1 xdp: add tracepoints for XDP mem
These tracepoints make it easier to troubleshoot XDP mem id disconnect.

The xdp:mem_disconnect tracepoint cannot be replaced via kprobe. It is
placed at the last stable place for the pointer to struct xdp_mem_allocator,
just before it's scheduled for RCU removal. It also extract info on
'safe_to_remove' and 'force'.

Detailed info about in-flight pages is not available at this layer. The next
patch will added tracepoints needed at the page_pool layer for this.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
99c07c43c4 xdp: tracking page_pool resources and safe removal
This patch is needed before we can allow drivers to use page_pool for
DMA-mappings. Today with page_pool and XDP return API, it is possible to
remove the page_pool object (from rhashtable), while there are still
in-flight packet-pages. This is safely handled via RCU and failed lookups in
__xdp_return() fallback to call put_page(), when page_pool object is gone.
In-case page is still DMA mapped, this will result in page note getting
correctly DMA unmapped.

To solve this, the page_pool is extended with tracking in-flight pages. And
XDP disconnect system queries page_pool and waits, via workqueue, for all
in-flight pages to be returned.

To avoid killing performance when tracking in-flight pages, the implement
use two (unsigned) counters, that in placed on different cache-lines, and
can be used to deduct in-flight packets. This is done by mapping the
unsigned "sequence" counters onto signed Two's complement arithmetic
operations. This is e.g. used by kernel's time_after macros, described in
kernel commit 1ba3aab303 and 5a581b367b, and also explained in RFC1982.

The trick is these two incrementing counters only need to be read and
compared, when checking if it's safe to free the page_pool structure. Which
will only happen when driver have disconnected RX/alloc side. Thus, on a
non-fast-path.

It is chosen that page_pool tracking is also enabled for the non-DMA
use-case, as this can be used for statistics later.

After this patch, using page_pool requires more strict resource "release",
e.g. via page_pool_release_page() that was introduced in this patchset, and
previous patches implement/fix this more strict requirement.

Drivers no-longer call page_pool_destroy(). Drivers already call
xdp_rxq_info_unreg() which call xdp_rxq_info_unreg_mem_model(), which will
attempt to disconnect the mem id, and if attempt fails schedule the
disconnect for later via delayed workqueue.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
e54cfd7e17 page_pool: introduce page_pool_free and use in mlx5
In case driver fails to register the page_pool with XDP return API (via
xdp_rxq_info_reg_mem_model()), then the driver can free the page_pool
resources more directly than calling page_pool_destroy(), which does a
unnecessarily RCU free procedure.

This patch is preparing for removing page_pool_destroy(), from driver
invocation.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Jesper Dangaard Brouer
6bf071bf09 xdp: page_pool related fix to cpumap
When converting an xdp_frame into an SKB, and sending this into the network
stack, then the underlying XDP memory model need to release associated
resources, because the network stack don't have callbacks for XDP memory
models.  The only memory model that needs this is page_pool, when a driver
use the DMA-mapping feature.

Introduce page_pool_release_page(), which basically does the same as
page_pool_unmap_page(). Add xdp_release_frame() as the XDP memory model
interface for calling it, if the memory model match MEM_TYPE_PAGE_POOL, to
save the function call overhead for others. Have cpumap call
xdp_release_frame() before xdp_scrub_frame().

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Ilias Apalodimas
a25d50bfe6 net: page_pool: add helper function to unmap dma addresses
On a previous patch dma addr was stored in 'struct page'.
Use that to unmap DMA addresses used by network drivers

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Ilias Apalodimas
0afdeeed08 net: page_pool: add helper function to retrieve dma addresses
On a previous patch dma addr was stored in 'struct page'.
Use that to retrieve DMA addresses used by network drivers

Signed-off-by: Ilias Apalodimas <ilias.apalodimas@linaro.org>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 11:23:13 -04:00
Thomas Gleixner
d2912cb15b treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Enrico Weigelt <info@metux.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:55 +02:00
Thomas Gleixner
caab277b1d treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 234
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 503 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Enrico Weigelt <info@metux.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190602204653.811534538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-19 17:09:07 +02:00
Jiri Pirko
9558a83aee net: flow_offload: implement support for meta key
Implement support for previously added flow dissector meta key.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 10:09:22 -04:00
Jiri Pirko
82828b88f0 flow_dissector: add support for ingress ifindex dissection
Add new key meta that contains ingress ifindex value and add a function
to dissect this from skb. The key and function is prepared to cover
other potential skb metadata values dissection.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-19 10:09:21 -04:00
Xin Long
6f6a862205 ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL
A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by
setting skb's dev to NULL" is also needed by ip6_tunnel.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 20:48:45 -04:00
Ido Schimmel
d5382fef70 ipv6: Stop sending in-kernel notifications for each nexthop
Both listeners - mlxsw and netdevsim - of IPv6 FIB notifications are now
ready to handle IPv6 multipath notifications.

Therefore, stop ignoring such notifications in both drivers and stop
sending notification for each added / deleted nexthop.

v2:
* Remove 'multipath_rt' from 'struct fib6_entry_notifier_info'

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:37 -07:00
Ido Schimmel
d4b96c7b51 ipv6: Extend notifier info for multipath routes
Extend the IPv6 FIB notifier info with number of sibling routes being
notified.

This will later allow listeners to process one notification for a
multipath routes instead of N, where N is the number of nexthops.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:36 -07:00
Ido Schimmel
c82481f7ea netlink: Add field to skip in-kernel notifications
The struct includes a 'skip_notify' flag that indicates if netlink
notifications to user space should be suppressed. As explained in commit
3b1137fe74 ("net: ipv6: Change notifications for multipath add to
RTA_MULTIPATH"), this is useful to suppress per-nexthop RTM_NEWROUTE
notifications when an IPv6 multipath route is added / deleted. Instead,
one notification is sent for the entire multipath route.

This concept is also useful for in-kernel notifications. Sending one
in-kernel notification for the addition / deletion of an IPv6 multipath
route - instead of one per-nexthop - provides a significant increase in
the insertion / deletion rate to underlying devices.

Add a 'skip_notify_kernel' flag to suppress in-kernel notifications.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:36 -07:00
Ido Schimmel
3de205cde4 netlink: Document all fields of 'struct nl_info'
Some fields were not documented. Add documentation.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-18 09:45:36 -07:00
David S. Miller
13091aa305 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Honestly all the conflicts were simple overlapping changes,
nothing really interesting to report.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 20:20:36 -07:00
Linus Torvalds
da0f382029 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "Lots of bug fixes here:

   1) Out of bounds access in __bpf_skc_lookup, from Lorenz Bauer.

   2) Fix rate reporting in cfg80211_calculate_bitrate_he(), from John
      Crispin.

   3) Use after free in psock backlog workqueue, from John Fastabend.

   4) Fix source port matching in fdb peer flow rule of mlx5, from Raed
      Salem.

   5) Use atomic_inc_not_zero() in fl6_sock_lookup(), from Eric Dumazet.

   6) Network header needs to be set for packet redirect in nfp, from
      John Hurley.

   7) Fix udp zerocopy refcnt, from Willem de Bruijn.

   8) Don't assume linear buffers in vxlan and geneve error handlers,
      from Stefano Brivio.

   9) Fix TOS matching in mlxsw, from Jiri Pirko.

  10) More SCTP cookie memory leak fixes, from Neil Horman.

  11) Fix VLAN filtering in rtl8366, from Linus Walluij.

  12) Various TCP SACK payload size and fragmentation memory limit fixes
      from Eric Dumazet.

  13) Use after free in pneigh_get_next(), also from Eric Dumazet.

  14) LAPB control block leak fix from Jeremy Sowden"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (145 commits)
  lapb: fixed leak of control-blocks.
  tipc: purge deferredq list for each grp member in tipc_group_delete
  ax25: fix inconsistent lock state in ax25_destroy_timer
  neigh: fix use-after-free read in pneigh_get_next
  tcp: fix compile error if !CONFIG_SYSCTL
  hv_sock: Suppress bogus "may be used uninitialized" warnings
  be2net: Fix number of Rx queues used for flow hashing
  net: handle 802.1P vlan 0 packets properly
  tcp: enforce tcp_min_snd_mss in tcp_mtu_probing()
  tcp: add tcp_min_snd_mss sysctl
  tcp: tcp_fragment() should apply sane memory limits
  tcp: limit payload size of sacked skbs
  Revert "net: phylink: set the autoneg state in phylink_phy_change"
  bpf: fix nested bpf tracepoints with per-cpu data
  bpf: Fix out of bounds memory access in bpf_sk_storage
  vsock/virtio: set SOCK_DONE on peer shutdown
  net: dsa: rtl8366: Fix up VLAN filtering
  net: phylink: set the autoneg state in phylink_phy_change
  net: add high_order_alloc_disable sysctl/static key
  tcp: add tcp_tx_skb_cache sysctl
  ...
2019-06-17 15:55:34 -07:00
Ard Biesheuvel
c681edae33 net: ipv4: move tcp_fastopen server side code to SipHash library
Using a bare block cipher in non-crypto code is almost always a bad idea,
not only for security reasons (and we've seen some examples of this in
the kernel in the past), but also for performance reasons.

In the TCP fastopen case, we call into the bare AES block cipher one or
two times (depending on whether the connection is IPv4 or IPv6). On most
systems, this results in a call chain such as

  crypto_cipher_encrypt_one(ctx, dst, src)
    crypto_cipher_crt(tfm)->cit_encrypt_one(crypto_cipher_tfm(tfm), ...);
      aesni_encrypt
        kernel_fpu_begin();
        aesni_enc(ctx, dst, src); // asm routine
        kernel_fpu_end();

It is highly unlikely that the use of special AES instructions has a
benefit in this case, especially since we are doing the above twice
for IPv6 connections, instead of using a transform which can process
the entire input in one go.

We could switch to the cbcmac(aes) shash, which would at least get
rid of the duplicated overhead in *some* cases (i.e., today, only
arm64 has an accelerated implementation of cbcmac(aes), while x86 will
end up using the generic cbcmac template wrapping the AES-NI cipher,
which basically ends up doing exactly the above). However, in the given
context, it makes more sense to use a light-weight MAC algorithm that
is more suitable for the purpose at hand, such as SipHash.

Since the output size of SipHash already matches our chosen value for
TCP_FASTOPEN_COOKIE_SIZE, and given that it accepts arbitrary input
sizes, this greatly simplifies the code as well.

NOTE: Server farms backing a single server IP for load balancing purposes
      and sharing a single fastopen key will be adversely affected by
      this change unless all systems in the pool receive their kernel
      upgrades at the same time.

Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-17 13:56:26 -07:00
Fernando Fernandez Mancera
d7f9b2f18e netfilter: synproxy: extract SYNPROXY infrastructure from {ipt, ip6t}_SYNPROXY
Add common functions into nf_synproxy_core.c to prepare for nftables support.
The prototypes of the functions used by {ipt, ip6t}_SYNPROXY are in the new
file nf_synproxy.h

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 17:12:55 +02:00
Christian Brauner
ff6d090d0d netfilter: bridge: port sysctls to use brnf_net
This ports the sysctls to use struct brnf_net.

With this patch we make it possible to namespace the br_netfilter module in
the following patch.

Signed-off-by: Christian Brauner <christian.brauner@ubuntu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:36:30 +02:00
Florian Westphal
87e389b4c2 netfilter: conntrack: small conntrack lookup optimization
____nf_conntrack_find() performs checks on the conntrack objects in
this order:

1. if (nf_ct_is_expired(ct))

This fetches ct->timeout, in third cache line.

The hnnode that is used to store the list pointers resides in the first
(origin) or second (reply tuple) cache lines.

This test rarely passes, but its necessary to reap obsolete entries.

2. if (nf_ct_is_dying(ct))

This fetches ct->status, also in third cache line.

The test is useless, and can be removed:
  Consider:
     cpu0                                           cpu1
    ct = ____nf_conntrack_find()
    atomic_inc_not_zero(ct) -> ok
    nf_ct_key_equal -> ok
    is_dying -> DYING bit not set, ok
                                                    set_bit(ct, DYING);
						    ... unhash ... etc.
    return ct
    -> returning a ct with dying bit set, despite
    having a test for it.

This (unlikely) case is fine - refcount prevents ct from getting free'd.

3. if (nf_ct_key_equal(h, tuple, zone, net))

nf_ct_key_equal checks in following order:

1. Tuple equal (first or second cacheline)
2. Zone equal (third cacheline)
3. confirmed bit set (->status, third cacheline)
4. net namespace match (third cacheline).

Swapping "timeout" and "cpu" places timeout in the first cacheline.
This has two advantages:

1. For a conntrack that won't even match the original tuple,
   we will now only fetch the first and maybe the second cacheline
   instead of always accessing the 3rd one as well.

2.  in case of TCP ct->timeout changes frequently because we
    reduce/increase it when there are packets outstanding in the network.

The first cacheline contains both the reference count and the ct spinlock,
i.e. moving timeout there avoids writes to 3rd cacheline.

The restart sequence in __nf_conntrack_find() is removed, if we found a
candidate, but then fail to increment the refcount or discover the tuple
has changed (object recycling), just pretend we did not find an entry.

A second lookup won't find anything until another CPU adds a new conntrack
with identical tuple into the hash table, which is very unlikely.

We have the confirmation-time checks (when we hold hash lock) that deal
with identical entries and even perform clash resolution in some cases.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-06-17 16:35:30 +02:00
Leon Romanovsky
98fdbea550 net/mlx5: Declare more strictly devlink encap mode
Devlink has UAPI declaration for encap mode, so there is no
need to be loose on the data get/set by drivers.

Update call sites to use enum devlink_eswitch_encap_mode
instead of plain u8.

Suggested-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Leon Romanovsky <leonro@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Parav Pandit <parav@mellanox.com>
Reviewed-by: Petr Vorel <pvorel@suse.cz>
2019-06-16 15:40:03 +03:00
Eric Dumazet
5f3e2bf008 tcp: add tcp_min_snd_mss sysctl
Some TCP peers announce a very small MSS option in their SYN and/or
SYN/ACK messages.

This forces the stack to send packets with a very high network/cpu
overhead.

Linux has enforced a minimal value of 48. Since this value includes
the size of TCP options, and that the options can consume up to 40
bytes, this means that each segment can include only 8 bytes of payload.

In some cases, it can be useful to increase the minimal value
to a saner value.

We still let the default to 48 (TCP_MIN_SND_MSS), for compatibility
reasons.

Note that TCP_MAXSEG socket option enforces a minimal value
of (TCP_MIN_MSS). David Miller increased this minimal value
in commit c39508d6f1 ("tcp: Make TCP_MAXSEG minimum more correct.")
from 64 to 88.

We might in the future merge TCP_MIN_SND_MSS and TCP_MIN_MSS.

CVE-2019-11479 -- tcp mss hardcoded to 48

Signed-off-by: Eric Dumazet <edumazet@google.com>
Suggested-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Tyler Hicks <tyhicks@canonical.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Eric Dumazet
3b4929f65b tcp: limit payload size of sacked skbs
Jonathan Looney reported that TCP can trigger the following crash
in tcp_shifted_skb() :

	BUG_ON(tcp_skb_pcount(skb) < pcount);

This can happen if the remote peer has advertized the smallest
MSS that linux TCP accepts : 48

An skb can hold 17 fragments, and each fragment can hold 32KB
on x86, or 64KB on PowerPC.

This means that the 16bit witdh of TCP_SKB_CB(skb)->tcp_gso_segs
can overflow.

Note that tcp_sendmsg() builds skbs with less than 64KB
of payload, so this problem needs SACK to be enabled.
SACK blocks allow TCP to coalesce multiple skbs in the retransmit
queue, thus filling the 17 fragments to maximal capacity.

CVE-2019-11477 -- u16 overflow of TCP_SKB_CB(skb)->tcp_gso_segs

Fixes: 832d11c5cd ("tcp: Try to restore large SKBs while SACK processing")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Jonathan Looney <jtl@netflix.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Tyler Hicks <tyhicks@canonical.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Bruce Curtis <brucec@netflix.com>
Cc: Jonathan Lemon <jonathan.lemon@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 18:47:31 -07:00
Jiri Pirko
a51486266c net: sched: remove NET_CLS_IND config option
This config option makes only couple of lines optional.
Two small helpers and an int in couple of cls structs.

Remove the config option and always compile this in.
This saves the user from unexpected surprises when he adds
a filter with ingress device match which is silently ignored
in case the config option is not set.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-15 14:06:13 -07:00
Vivien Didelot
68b2d4a844 net: dsa: make cpu_dp non const
A port may trigger operations on its dedicated CPU port, so using
cpu_dp as const will raise warnings. Make cpu_dp non const.

Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:20:07 -07:00
Eric Dumazet
ce27ec6064 net: add high_order_alloc_disable sysctl/static key
>From linux-3.7, (commit 5640f76858 "net: use a per task frag
allocator") TCP sendmsg() has preferred using order-3 allocations.

While it gives good results for most cases, we had reports
that heavy uses of TCP over loopback were hitting a spinlock
contention in page allocations/freeing.

This commits adds a sysctl so that admins can opt-in
for order-0 allocations. Hopefully mm layer might optimize
order-3 allocations in the future since it could give us
a nice boost  (see 8 lines of following benchmark)

The following benchmark shows a win when more than 8 TCP_STREAM
threads are running (56 x86 cores server in my tests)

for thr in {1..30}
do
 sysctl -wq net.core.high_order_alloc_disable=0
 T0=`./super_netperf $thr -H 127.0.0.1 -l 15`
 sysctl -wq net.core.high_order_alloc_disable=1
 T1=`./super_netperf $thr -H 127.0.0.1 -l 15`
 echo $thr:$T0:$T1
done

1: 49979: 37267
2: 98745: 76286
3: 141088: 110051
4: 177414: 144772
5: 197587: 173563
6: 215377: 208448
7: 241061: 234087
8: 267155: 263373
9: 295069: 297402
10: 312393: 335213
11: 340462: 368778
12: 371366: 403954
13: 412344: 443713
14: 426617: 473580
15: 474418: 507861
16: 503261: 538539
17: 522331: 563096
18: 532409: 567084
19: 550824: 605240
20: 525493: 641988
21: 564574: 665843
22: 567349: 690868
23: 583846: 710917
24: 588715: 736306
25: 603212: 763494
26: 604083: 792654
27: 602241: 796450
28: 604291: 797993
29: 611610: 833249
30: 577356: 841062

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet
0b7d7f6b22 tcp: add tcp_tx_skb_cache sysctl
Feng Tang reported a performance regression after introduction
of per TCP socket tx/rx caches, for TCP over loopback (netperf)

There is high chance the regression is caused by a change on
how well the 32 KB per-thread page (current->task_frag) can
be recycled, and lack of pcp caches for order-3 pages.

I could not reproduce the regression myself, cpus all being
spinning on the mm spinlocks for page allocs/freeing, regardless
of enabling or disabling the per tcp socket caches.

It seems best to disable the feature by default, and let
admins enabling it.

MM layer either needs to provide scalable order-3 pages
allocations, or could attempt a trylock on zone->lock if
the caller only attempts to get a high-order page and is
able to fallback to order-0 ones in case of pressure.

Tests run on a 56 cores host (112 hyper threads)

-	35.49%	netperf 		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
   - 35.49% queued_spin_lock_slowpath
	  - 18.18% get_page_from_freelist
		 - __alloc_pages_nodemask
			- 18.18% alloc_pages_current
				 skb_page_frag_refill
				 sk_page_frag_refill
				 tcp_sendmsg_locked
				 tcp_sendmsg
				 inet_sendmsg
				 sock_sendmsg
				 __sys_sendto
				 __x64_sys_sendto
				 do_syscall_64
				 entry_SYSCALL_64_after_hwframe
				 __libc_send
	  + 17.31% __free_pages_ok
+	31.43%	swapper 		 [kernel.vmlinux]	  [k] intel_idle
+	 9.12%	netperf 		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
+	 6.53%	netserver		 [kernel.vmlinux]	  [k] copy_user_enhanced_fast_string
+	 0.69%	netserver		 [kernel.vmlinux]	  [k] queued_spin_lock_slowpath
+	 0.68%	netperf 		 [kernel.vmlinux]	  [k] skb_release_data
+	 0.52%	netperf 		 [kernel.vmlinux]	  [k] tcp_sendmsg_locked
	 0.46%	netperf 		 [kernel.vmlinux]	  [k] _raw_spin_lock_irqsave

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Feng Tang <feng.tang@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet
ede61ca474 tcp: add tcp_rx_skb_cache sysctl
Instead of relying on rps_needed, it is safer to use a separate
static key, since we do not want to enable TCP rx_skb_cache
by default. This feature can cause huge increase of memory
usage on hosts with millions of sockets.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 20:18:28 -07:00
Eric Dumazet
d6fb396cfa ipv4: tcp: fix ACK/RST sent with a transmit delay
If we want to set a EDT time for the skb we want to send
via ip_send_unicast_reply(), we have to pass a new parameter
and initialize ipc.sockc.transmit_time with it.

This fixes the EDT time for ACK/RST packets sent on behalf of
a TIME_WAIT socket.

Fixes: a842fe1425 ("tcp: add optional per socket transmit delay")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 19:51:12 -07:00
Martin KaFai Lau
99f3a064bc bpf: net: Add SO_DETACH_REUSEPORT_BPF
There is SO_ATTACH_REUSEPORT_[CE]BPF but there is no DETACH.
This patch adds SO_DETACH_REUSEPORT_BPF sockopt.  The same
sockopt can be used to undo both SO_ATTACH_REUSEPORT_[CE]BPF.

reseport_detach_prog() is added and it is mostly a mirror
of the existing reuseport_attach_prog().  The differences are,
it does not call reuseport_alloc() and returns -ENOENT when
there is no old prog.

Cc: Craig Gallek <kraig@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-06-15 01:21:19 +02:00
David S. Miller
d96ec97511 Many changes all over:
* HE (802.11ax) work continues
  * WPA3 offloads
  * work on extended key ID handling continues
  * fixes to honour AP supported rates with auth/assoc frames
  * nl80211 netlink policy improvements to fix some issues
    with strict validation on new commands with old attrs
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl0Dq/sACgkQB8qZga/f
 l8RqFg/+MBcuqvW2xTy5o5Lbw7Drx5ROgFT2ZRAO6PTeboQ43NOBiXt2dEhDbp+w
 mHChImF85px3SFMBSvuf97zlScNV6+VJraDDjoZFixt/gIZ/XsdURo5i4IGmUbfj
 +LY1oPm7suC5Cold+yPicHTukFpeU7cSwceslFsecqiN5unlzIxf6gY9H7OL7WGT
 s0Wis0x3y2m9mMi4cvQfHkFzplcTc5SBgPLyLQtHUNx1eySEZ+AymlNVmbGrRWr9
 vaCU5W9+Wz0N6lEB/UI5y6fZzj5mhkcimGck1Os7dFeC7KWjntjT9iKIkFHWehxi
 QfLcK6pGjLpPpMTQtOEfl34ZGnOyO8N9GmOLaaUaBeaZItabYJwfgbdr7NxiJvta
 1cyqXek+D2G7WOa0aIrWhmwswKGBa3nIBqS/ZP/SEWLEzU1Cn0NiAD5Ba016TC4C
 D+1BBXIdpQDoZCgfd6KkGs2Ynf/8N3OwHW+EwjpAu3IARTQzb6tMWSvkAuAgJt1F
 dBD7NqdFhWXFfxqf9NpB8bkmpyNKM4Km6eO2HKpCg/5suKqYJ1Xj9EeQin1B+QsE
 Jntj69hQ6Kj2gKBPy+RnCBFbxMNuFhpc1kmUOGj9U9aAcOntV0woVOyFGsbRmFo3
 MI8aVU/gjQDCcHHD5xtJGHa11uIefXq1r2H7Um3sxKYeBsqFjP4=
 =j+Um
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Many changes all over:
 * HE (802.11ax) work continues
 * WPA3 offloads
 * work on extended key ID handling continues
 * fixes to honour AP supported rates with auth/assoc frames
 * nl80211 netlink policy improvements to fix some issues
   with strict validation on new commands with old attrs
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 11:27:26 -07:00
David S. Miller
2a2af5e6e6 Various fixes, all over:
* a few memory leaks
  * fixes for management frame protection security
    and A2/A3 confusion (affecting TDLS as well)
  * build fix for certificates
  * etc.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAl0DpdQACgkQB8qZga/f
 l8RvRhAAnJGBRb73LMCUGQdgv8IXXVX8jwRwIwLT9FeJc9Pg9I7o/d4UXvH0ss2D
 qIPWC7CYuI8LUuyu8RXiO3iFKKbWaWDI9cQj9jKXTRWSTUGgSs1zgS3yEcJPJY/V
 q74g3MjK9yYE7UUbhI/ud5yrKEc6XXAWgaGKZzuNYS/SR6vpmy/v+jH8SLKjIS48
 iXQUAQJn/TgIynjfm/d8GNLr5TN5i4uqRD6trdSeWaKIVK/3Q8GO4C6DvqLJuClJ
 n7XTUG0Xbzs4U+k5abtTsRIz6Mh5nHiqCPS/ueeQuLASJzVeXg2mfNGzsbJdLh0c
 J65kbvBqeG0/AD5uybl8VmUgcW/mSDevM6g1pOVbHDrPcg1dyzQBAihKRaoAkM0f
 9YpzWxkQSt9loE1Md9Fn0knhesttt/2wc72Rs/jEeDftj1NP7nt3fnHF2xufHHdb
 JYjsgcLX3rmIrRSvn4yup8kPWmaCI0dvPDbfSQEH9PrthhQVCtHuiFwAmu9LO0o4
 CQ0RuiYFKYOuigabVn32w3S57jKBo/ie09Nnw/sJIsXDiaPLGyxp9L+5wbf8Dhnd
 OeqlhYZD26oTx/gz0lxXjX19ZfWZEN2rpXTCPq7FVMgWMjGJCgQok4WfTfB39hc+
 lPJOa4YM6G1rRppZQfUUmayPXYLw1VJTioNSf8TMLW7opf2aT1o=
 =AHi5
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-06-14' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Various fixes, all over:
 * a few memory leaks
 * fixes for management frame protection security
   and A2/A3 confusion (affecting TDLS as well)
 * build fix for certificates
 * etc.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-14 09:36:51 -07:00
James Prestwood
1c38c7f220 nl80211: send event when CMD_FRAME duration expires
cfg80211_remain_on_channel_expired is used to notify userspace when
the remain on channel duration expired by sending an event. There is
no such equivalent to CMD_FRAME, where if offchannel and a duration
is provided, the card will go offchannel for that duration. Currently
there is no way for userspace to tell when that duration expired
apart from setting an independent timeout. This timeout is quite
erroneous as the kernel may not immediately send out the frame
because of scheduling or work queue delays. In testing, it was found
this timeout had to be quite large to accomidate any potential delays.

A better solution is to have the kernel send an event when this
duration has expired. There is already NL80211_CMD_FRAME_WAIT_CANCEL
which can be used to cancel a NL80211_CMD_FRAME offchannel. Using this
command matches perfectly to how NL80211_CMD_CANCEL_REMAIN_ON_CHANNEL
works, where its both used to cancel and notify if the duration has
expired.

Signed-off-by: James Prestwood <james.prestwood@linux.intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 16:08:28 +02:00
Johannes Berg
1e87fec9fa mac80211: call rate_control_send_low() internally
There's no rate control algorithm that *doesn't* want to call
it internally, and calling it internally will let us modify
its behaviour in the future.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:17:37 +02:00
Ilan Peer
4770c8f902 cfg80211: Add a function to iterate all BSS entries
Add a function that iterates over the BSS entries associated with a
given wiphy and calls a callback for each iterated BSS. This can be
used by drivers in various ways, e.g., to evaluate some property for
all the BSSs in the medium.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:16:02 +02:00
John Crispin
a0de1ca383 mac80211: allow turning TWT responder support on and off via netlink
Allow the userland daemon to en/disable TWT support for an AP.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
[simplify parsing code]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:14:10 +02:00
John Crispin
c9d3245e03 mac80211: dynamically enable the TWT requester support on STA interfaces
Turn TWT for STA interfaces when they associate and/or receive a
beacon where the twt_responder bit has changed.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:12:13 +02:00
Johannes Berg
901bb98918 nl80211: require and validate vendor command policy
Require that each vendor command give a policy of its sub-attributes
in NL80211_ATTR_VENDOR_DATA, and then (stricly) check the contents,
including the NLA_F_NESTED flag that we couldn't check on the outer
layer because there we don't know yet.

It is possible to use VENDOR_CMD_RAW_DATA for raw data, but then no
nested data can be given (NLA_F_NESTED flag must be clear) and the
data is just passed as is to the command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:12:01 +02:00
John Crispin
d7edf40c15 mac80211: add ieee80211_get_he_iftype_cap() helper
This function is similar to ieee80211_get_he_sta_cap() but allows passing
the iftype. Also make ieee80211_get_he_sta_cap() use the new helper
rather than duplicating the code.

Signed-off-by: Shashidhar Lakkavalli <slakkavalli@datto.com>
Signed-off-by: John Crispin <john@phrozen.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:10:47 +02:00
Chung-Hsien Hsu
26f7044e95 nl80211: add support for SAE authentication offload
Let drivers advertise support for station-mode SAE authentication
offload with a new NL80211_EXT_FEATURE_SAE_OFFLOAD flag.

Signed-off-by: Chung-Hsien Hsu <stanley.hsu@cypress.com>
Signed-off-by: Chi-Hsien Lin <chi-hsien.lin@cypress.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:07:49 +02:00
Alexander Wetzel
90cc4bd611 mac80211: AMPDU handling for Extended Key ID
IEEE 802.11 - 2016 forbids mixing MPDUs with different keyIDs in one
A-MPDU. Drivers supporting A-MPDUs and Extended Key ID must actively
enforce that requirement due to the available two unicast keyIDs.

Allow driver to signal mac80211 that they will not check the keyID in
MPDUs when aggregating them and that they expect mac80211 to stop Tx
aggregation when rekeying a connection using Extended Key ID.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-06-14 14:05:35 +02:00
Eric Dumazet
a842fe1425 tcp: add optional per socket transmit delay
Adding delays to TCP flows is crucial for studying behavior
of TCP stacks, including congestion control modules.

Linux offers netem module, but it has unpractical constraints :
- Need root access to change qdisc
- Hard to setup on egress if combined with non trivial qdisc like FQ
- Single delay for all flows.

EDT (Earliest Departure Time) adoption in TCP stack allows us
to enable a per socket delay at a very small cost.

Networking tools can now establish thousands of flows, each of them
with a different delay, simulating real world conditions.

This requires FQ packet scheduler or a EDT-enabled NIC.

This patchs adds TCP_TX_DELAY socket option, to set a delay in
usec units.

  unsigned int tx_delay = 10000; /* 10 msec */

  setsockopt(fd, SOL_TCP, TCP_TX_DELAY, &tx_delay, sizeof(tx_delay));

Note that FQ packet scheduler limits might need some tweaking :

man tc-fq

PARAMETERS
   limit
       Hard  limit  on  the  real  queue  size. When this limit is
       reached, new packets are dropped. If the value is  lowered,
       packets  are  dropped so that the new limit is met. Default
       is 10000 packets.

   flow_limit
       Hard limit on the maximum  number  of  packets  queued  per
       flow.  Default value is 100.

Use of TCP_TX_DELAY option will increase number of skbs in FQ qdisc,
so packets would be dropped if any of the previous limit is hit.

Use of a jump label makes this support runtime-free, for hosts
never using the option.

Also note that TSQ (TCP Small Queues) limits are slightly changed
with this patch : we need to account that skbs artificially delayed
wont stop us providind more skbs to feed the pipe (netem uses
skb_orphan_partial() for this purpose, but FQ can not use this trick)

Because of that, using big delays might very well trigger
old bugs in TSO auto defer logic and/or sndbuf limited detection.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 13:05:43 -07:00
Stephen Suryaputra
e1ae5c2ea4 vrf: Increment Icmp6InMsgs on the original netdev
Get the ingress interface and increment ICMP counters based on that
instead of skb->dev when the the dev is a VRF device.

This is a follow up on the following message:
https://www.spinics.net/lists/netdev/msg560268.html

v2: Avoid changing skb->dev since it has unintended effect for local
    delivery (David Ahern).
Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 11:00:11 -07:00
Maxime Chevallier
f0d2ca1531 net: ethtool: Allow matching on vlan DEI bit
Using ethtool, users can specify a classification action matching on the
full vlan tag, which includes the DEI bit (also previously called CFI).

However, when converting the ethool_flow_spec to a flow_rule, we use
dissector keys to represent the matching patterns.

Since the vlan dissector key doesn't include the DEI bit, this
information was silently discarded when translating the ethtool
flow spec in to a flow_rule.

This commit adds the DEI bit into the vlan dissector key, and allows
propagating the information to the driver when parsing the ethtool flow
spec.

Fixes: eca4205f9e ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
Reported-by: Michał Mirosław <mirq-linux@rere.qmqm.pl>
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-12 10:09:56 -07:00
Jakub Kicinski
5018007409 net/tls: add kernel-driven resync mechanism for TX
TLS offload drivers keep track of TCP seq numbers to make sure
the packets are fed into the HW in order.

When packets get dropped on the way through the stack, the driver
will get out of sync and have to use fallback encryption, but unless
TCP seq number is resynced it will never match the packets correctly
(or even worse - use incorrect record sequence number after TCP seq
wraps).

Existing drivers (mlx5) feed the entire record on every out-of-order
event, allowing FW/HW to always be in sync.

This patch adds an alternative, more akin to the RX resync.  When
driver sees a frame which is past its expected sequence number the
stream must have gotten out of order (if the sequence number is
smaller than expected its likely a retransmission which doesn't
require resync).  Driver will ask the stack to perform TX sync
before it submits the next full record, and fall back to software
crypto until stack has performed the sync.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:27 -07:00
Jakub Kicinski
eeb2efaf36 net/tls: generalize the resync callback
Currently only RX direction is ever resynced, however, TX may
also get out of sequence if packets get dropped on the way to
the driver.  Rename the resync callback and add a direction
parameter.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:27 -07:00
Jakub Kicinski
f953d33ba1 net/tls: add kernel-driven TLS RX resync
TLS offload device may lose sync with the TCP stream if packets
arrive out of order.  Drivers can currently request a resync at
a specific TCP sequence number.  When a record is found starting
at that sequence number kernel will inform the device of the
corresponding record number.

This requires the device to constantly scan the stream for a
known pattern (constant bytes of the header) after sync is lost.

This patch adds an alternative approach which is entirely under
the control of the kernel.  Kernel tracks records it had to fully
decrypt, even though TLS socket is in TLS_HW mode.  If multiple
records did not have any decrypted parts - it's a pretty strong
indication that the device is out of sync.

We choose the min number of fully encrypted records to be 2,
which should hopefully be more than will get retransmitted at
a time.

After kernel decides the device is out of sync it schedules a
resync request.  If the TCP socket is empty the resync gets
performed immediately.  If socket is not empty we leave the
record parser to resync when next record comes.

Before resync in message parser we peek at the TCP socket and
don't attempt the sync if the socket already has some of the
next record queued.

On resync failure (encrypted data continues to flow in) we
retry with exponential backoff, up to once every 128 records
(with a 16k record thats at most once every 2M of data).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski
fe58a5a02c net/tls: rename handle_device_resync()
handle_device_resync() doesn't describe the function very well.
The function checks if resync should be issued upon parsing of
a new record.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jakub Kicinski
89fec474fa net/tls: pass record number as a byte array
TLS offload code casts record number to a u64.  The buffer
should be aligned to 8 bytes, but its actually a __be64, and
the rest of the TLS code treats it as big int.  Make the
offload callbacks take a byte array, drivers can make the
choice to do the ugly cast if they want to.

Prepare for copying the record number onto the stack by
defining a constant for max size of the byte array.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-11 12:22:26 -07:00
Jonathan Lemon
fada7fdc83 bpf: Allow bpf_map_lookup_elem() on an xskmap
Currently, the AF_XDP code uses a separate map in order to
determine if an xsk is bound to a queue.  Instead of doing this,
have bpf_map_lookup_elem() return a xdp_sock.

Rearrange some xdp_sock members to eliminate structure holes.

Remove selftest - will be added back in later patch.

Signed-off-by: Jonathan Lemon <jonathan.lemon@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-06-10 23:31:26 -07:00
David Ahern
5b98324ebe ipv6: Allow routes to use nexthop objects
Add support for RTA_NH_ID attribute to allow a user to specify a
nexthop id to use with a route. fc_nh_id is added to fib6_config to
hold the value passed in the RTA_NH_ID attribute. If a nexthop id
is given, the gateway, device, encap and multipath attributes can
not be set.

Update ip6_route_del to check metric and protocol before nexthop
specs. If fc_nh_id is set, then it must match the id in the route
entry. Since IPv6 allows delete of a cached entry (an exception),
add ip6_del_cached_rt_nh to cycle through all of the fib6_nh in
a fib entry if it is using a nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:57 -07:00
David Ahern
493ced1ac4 ipv4: Allow routes to use nexthop objects
Add support for RTA_NH_ID attribute to allow a user to specify a
nexthop id to use with a route. fc_nh_id is added to fib_config to
hold the value passed in the RTA_NH_ID attribute. If a nexthop id
is given, the gateway, device, encap and multipath attributes can
not be set.

Update fib_nh_match to check ids on a route delete.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
David Ahern
f88c9aa12f nexthops: Add ipv6 helper to walk all fib6_nh in a nexthop struct
IPv6 has traditionally had a single fib6_nh per fib6_info. With
nexthops we can have multiple fib6_nh associated with a fib6_info.
Add a nexthop helper to invoke a callback for each fib6_nh in a
'struct nexthop'. If the callback returns non-0, the loop is
stopped and the return value passed to the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-10 10:44:56 -07:00
Eric Dumazet
c67b85558f ipv6: tcp: send consistent autoflowlabel in TIME_WAIT state
In case autoflowlabel is in action, skb_get_hash_flowi6()
derives a non zero skb->hash to the flowlabel.

If skb->hash is zero, a flow dissection is performed.

Since all TCP skbs sent from ESTABLISH state inherit their
skb->hash from sk->sk_txhash, we better keep a copy
of sk->sk_txhash into the TIME_WAIT socket.

After this patch, ACK or RST packets sent on behalf of
a TIME_WAIT socket have the flowlabel that was previously
used by the flow.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 20:10:19 -07:00
Sebastian Andrzej Siewior
6dcdd884e2 net: hwbm: Make the hwbm_pool lock a mutex
Based on review, `lock' is only acquired in hwbm_pool_add() which is
invoked via ->probe(), ->resume() and ->ndo_change_mtu(). Based on this
the lock can become a mutex and there is no need to disable interrupts
during the procedure.
Now that the lock is a mutex, hwbm_pool_add() no longer invokes
hwbm_pool_refill() in an atomic context so we can pass GFP_KERNEL to
hwbm_pool_refill() and remove the `gfp' argument from hwbm_pool_add().

Cc: Thomas Petazzoni <thomas.petazzoni@bootlin.com>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 19:40:10 -07:00
Dan Carpenter
5270041d34 nexthop: off by one in nexthop_mpath_select()
The nhg->nh_entries[] array is allocated in nexthop_grp_alloc() and it
has nhg->num_nh elements so this check should be >= instead of >.

Fixes: 430a049190 ("nexthop: Add support for nexthop groups")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:37:25 -07:00
Jarod Wilson
5237ff79b2 bonding: add slave_foo printk macros
Where possible, we generally want both the bond master and the relevant slave
information in message output. Standardize the format using new slave_*
printk macros.

Suggested-by: Joe Perches <joe@perches.com>
CC: Jay Vosburgh <j.vosburgh@gmail.com>
CC: Veaceslav Falico <vfalico@gmail.com>
CC: Andy Gospodarek <andy@greyhouse.net>
CC: netdev@vger.kernel.org
Signed-off-by: Jarod Wilson <jarod@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-09 13:36:01 -07:00
Vladimir Oltean
5e3f847a02 net: dsa: Add teardown callback for drivers
This is helpful for e.g. draining per-driver (not per-port) tagger
queues.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-08 15:20:39 -07:00
Linus Torvalds
9331b6740f SPDX update for 5.2-rc4
Another round of SPDX header file fixes for 5.2-rc4
 
 These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
 added, based on the text in the files.  We are slowly chipping away at
 the 700+ different ways people tried to write the license text.  All of
 these were reviewed on the spdx mailing list by a number of different
 people.
 
 We now have over 60% of the kernel files covered with SPDX tags:
 	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
 	Files checked:            64533
 	Files with SPDX:          40392
 	Files with errors:            0
 
 I think the majority of the "easy" fixups are now done, it's now the
 start of the longer-tail of crazy variants to wade through.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPuGTg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykBvQCg2SG+HmDH+tlwKLT/q7jZcLMPQigAoMpt9Uuy
 sxVEiFZo8ZU9v1IoRb1I
 =qU++
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull yet more SPDX updates from Greg KH:
 "Another round of SPDX header file fixes for 5.2-rc4

  These are all more "GPL-2.0-or-later" or "GPL-2.0-only" tags being
  added, based on the text in the files. We are slowly chipping away at
  the 700+ different ways people tried to write the license text. All of
  these were reviewed on the spdx mailing list by a number of different
  people.

  We now have over 60% of the kernel files covered with SPDX tags:
	$ ./scripts/spdxcheck.py -v 2>&1 | grep Files
	Files checked:            64533
	Files with SPDX:          40392
	Files with errors:            0

  I think the majority of the "easy" fixups are now done, it's now the
  start of the longer-tail of crazy variants to wade through"

* tag 'spdx-5.2-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (159 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 450
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 449
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 448
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 446
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 445
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 444
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 443
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 442
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 441
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 440
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 438
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 437
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 436
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 435
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 434
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 433
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 432
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 431
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 430
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 429
  ...
2019-06-08 12:52:42 -07:00
David S. Miller
a6cdeeb16b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Some ISDN files that got removed in net-next had some changes
done in mainline, take the removals.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-07 11:00:14 -07:00
Linus Torvalds
1e1d926369 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Free AF_PACKET po->rollover properly, from Willem de Bruijn.

 2) Read SFP eeprom in max 16 byte increments to avoid problems with
    some SFP modules, from Russell King.

 3) Fix UDP socket lookup wrt. VRF, from Tim Beale.

 4) Handle route invalidation properly in s390 qeth driver, from Julian
    Wiedmann.

 5) Memory leak on unload in RDS, from Zhu Yanjun.

 6) sctp_process_init leak, from Neil HOrman.

 7) Fix fib_rules rule insertion semantic change that broke Android,
    from Hangbin Liu.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (33 commits)
  pktgen: do not sleep with the thread lock held.
  net: mvpp2: Use strscpy to handle stat strings
  net: rds: fix memory leak in rds_ib_flush_mr_pool
  ipv6: fix EFAULT on sendto with icmpv6 and hdrincl
  ipv6: use READ_ONCE() for inet->hdrincl as in ipv4
  Revert "fib_rules: return 0 directly if an exactly same rule exists when NLM_F_EXCL not supplied"
  net: aquantia: fix wol configuration not applied sometimes
  ethtool: fix potential userspace buffer overflow
  Fix memory leak in sctp_process_init
  net: rds: fix memory leak when unload rds_rdma
  ipv6: fix the check before getting the cookie in rt6_get_cookie
  ipv4: not do cache for local delivery if bc_forwarding is enabled
  s390/qeth: handle error when updating TX queue count
  s390/qeth: fix VLAN attribute in bridge_hostnotify udev event
  s390/qeth: check dst entry before use
  s390/qeth: handle limited IPv4 broadcast in L3 TX path
  net: fix indirect calls helpers for ptype list hooks.
  net: ipvlan: Fix ipvlan device tso disabled while NETIF_F_IP_CSUM is set
  udp: only choose unbound UDP socket for multicast when not in a VRF
  net/tls: replace the sleeping lock around RX resync with a bit lock
  ...
2019-06-07 09:29:14 -07:00
Dirk van der Merwe
b9727d7f95 net/tls: export TLS per skb encryption
While offloading TLS connections, drivers need to handle the case where
out of order packets need to be transmitted.

Other drivers obtain the entire TLS record for the specific skb to
provide as context to hardware for encryption. However, other designs
may also want to keep the hardware state intact and perform the
out of order encryption entirely on the host.

To achieve this, export the already existing software encryption
fallback path so drivers could access this.

Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 14:13:40 -07:00
Jakub Kicinski
2e361176ea net/tls: simplify driver context retrieval
Currently drivers have to ensure the alignment of their tls state
structure, which leads to unnecessary layers of getters and
encapsulated structures in each driver.

Simplify all this by marking the driver state as aligned (driver_state
members are currently aligned, so no hole is added, besides ALIGN in
TLS_OFFLOAD_CONTEXT_SIZE_RX/TX would reserve this extra space, anyway.)
With that we can add a common accessor to the core.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 14:13:40 -07:00
Jakub Kicinski
2d6b51c692 net/tls: split the TLS_DRIVER_STATE_SIZE and bump TX to 16 bytes
8 bytes of driver state has been enough so far, but for drivers
which have to store 8 byte handle it's no longer practical to
store the state directly in the context.

Drivers generally don't need much extra state on RX side, while
TX side has to be tracking TCP sequence numbers.  Split the
lengths of max driver state size on RX and TX.

The struct tls_offload_context_tx currently stands at 616 bytes and
struct tls_offload_context_rx stands at 368 bytes.  Upcoming work
will consume extra 8 bytes in both for kernel-driven resync.
This means that we can bump TX side to 16 bytes and still fit
into the same number of cache lines but on RX side we would be 8
bytes over.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 14:13:40 -07:00
Litao jiao
fe1e0713bb vxlan: Use FDB_HASH_SIZE hash_locks to reduce contention
The monolithic hash_lock could cause huge contention when
inserting/deletiing vxlan_fdbs into the fdb_head.

Use FDB_HASH_SIZE hash_locks to protect insertions/deletions
of vxlan_fdbs into the fdb_head hash table.

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Litao jiao <jiaolitao@raisecom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-06 11:08:55 -07:00
Florian Westphal
4f518e802c xfrm: remove type and offload_type map from xfrm_state_afinfo
Only a handful of xfrm_types exist, no need to have 512 pointers for them.

Reduces size of afinfo struct from 4k to 120 bytes on 64bit platforms.

Also, the unregister function doesn't need to return an error, no single
caller does anything useful with it.

Just place a WARN_ON() where needed instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06 08:34:50 +02:00
Florian Westphal
4c203b0454 xfrm: remove eth_proto value from xfrm_state_afinfo
xfrm_prepare_input needs to lookup the state afinfo backend again to fetch
the address family ethernet protocol value.

There are only two address families, so a switch statement is simpler.
While at it, use u8 for family and proto and remove the owner member --
its not used anywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06 08:34:50 +02:00
Florian Westphal
3aaf3915a3 xfrm: remove state and template sort indirections from xfrm_state_afinfo
No module dependency, placing this in xfrm_state.c avoids need for
an indirection.

This also removes the state spinlock -- I don't see why we would need
to hold it during sorting.

This in turn allows to remove the 'net' argument passed to
xfrm_tmpl_sort.  Last, remove the EXPORT_SYMBOL, there are no modular
callers.

For the CONFIG_IPV6=m case, vmlinux size increase is about 300 byte.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-06 08:34:50 +02:00
Xin Long
b7999b0772 ipv6: fix the check before getting the cookie in rt6_get_cookie
In Jianlin's testing, netperf was broken with 'Connection reset by peer',
as the cookie check failed in rt6_check() and ip6_dst_check() always
returned NULL.

It's caused by Commit 93531c6743 ("net/ipv6: separate handling of FIB
entries from dst based routes"), where the cookie can be got only when
'c1'(see below) for setting dst_cookie whereas rt6_check() is called
when !'c1' for checking dst_cookie, as we can see in ip6_dst_check().

Since in ip6_dst_check() both rt6_dst_from_check() (c1) and rt6_check()
(!c1) will check the 'from' cookie, this patch is to remove the c1 check
in rt6_get_cookie(), so that the dst_cookie can always be set properly.

c1:
  (rt->rt6i_flags & RTF_PCPU || unlikely(!list_empty(&rt->rt6i_uncached)))

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-05 17:00:29 -07:00
Thomas Gleixner
46fe777164 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 399
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 33 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081038.745679586@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:12 +02:00
Thomas Gleixner
ddc64d0ac9 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 363
Based on 1 normalized pattern(s):

  released under terms in gpl version 2 see copying

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 5 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Armijn Hemel <armijn@tjaldur.nl>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190531081035.689962394@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:09 +02:00
Thomas Gleixner
2b27bdcc20 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 336
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin st fifth floor boston ma 02110
  1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 246 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.674189849@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:07 +02:00
Thomas Gleixner
a61127c213 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 335
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin st fifth floor boston ma 02110
  1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 111 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000436.567572064@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:06 +02:00
Thomas Gleixner
685a6bf848 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 321
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation version 2 and no later version this
  program is distributed in the hope that it will be useful but
  without any warranty without even the implied warranty of
  merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 33 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190530000435.345978407@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:37:05 +02:00
Thomas Gleixner
5b497af42f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 295
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of version 2 of the gnu general public license as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 64 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.894819585@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:38 +02:00
Thomas Gleixner
2025cf9e19 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 288
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 263 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141901.208660670@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:36:37 +02:00
Thomas Gleixner
04672fe6d6 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 268
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to the free
  software foundation inc 51 franklin street fifth floor boston ma
  02110 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 46 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141334.135501091@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:29 +02:00
Thomas Gleixner
fb9e53cce7 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 257
Based on 1 normalized pattern(s):

  gpl v2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 19 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190529141333.108140152@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-06-05 17:30:27 +02:00
Florian Westphal
e46817472a xfrm: remove init_flags indirection from xfrm_state_afinfo
There is only one implementation of this function; just call it directly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05 13:16:30 +02:00
Florian Westphal
5c1b9ab3ec xfrm: remove init_temprop indirection from xfrm_state_afinfo
same as previous patch: just place this in the caller, no need to
have an indirection for a structure initialization.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05 13:16:30 +02:00
Florian Westphal
bac9593515 xfrm: remove init_tempsel indirection from xfrm_state_afinfo
Simple initialization, handle it in the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-06-05 13:16:30 +02:00
David Ahern
f88d8ea67f ipv6: Plumb support for nexthop object in a fib6_info
Add struct nexthop and nh_list list_head to fib6_info. nh_list is the
fib6_info side of the nexthop <-> fib_info relationship. Since a fib6_info
referencing a nexthop object can not have 'sibling' entries (the old way
of doing multipath routes), the nh_list is a union with fib6_siblings.

Add f6i_list list_head to 'struct nexthop' to track fib6_info entries
using a nexthop instance. Update __remove_nexthop_fib to walk f6_list
and delete fib entries using the nexthop.

Add a few nexthop helpers for use when a nexthop is added to fib6_info:
- nexthop_fib6_nh - return first fib6_nh in a nexthop object
- fib6_info_nh_dev moved to nexthop.h and updated to use nexthop_fib6_nh
  if the fib6_info references a nexthop object
- nexthop_path_fib6_result - similar to ipv4, select a path within a
  multipath nexthop object. If the nexthop is a blackhole, set
  fib6_result type to RTN_BLACKHOLE, and set the REJECT flag

Update the fib6_info references to check for nh and take a different path
as needed:
- rt6_qualify_for_ecmp - if a fib entry uses a nexthop object it can NOT
  be coalesced with other fib entries into a multipath route
- rt6_duplicate_nexthop - use nexthop_cmp if either fib6_info references
  a nexthop
- addrconf (host routes), RA's and info entries (anything configured via
  ndisc) does not use nexthop objects
- fib6_info_destroy_rcu - put reference to nexthop object
- fib6_purge_rt - drop fib6_info from f6i_list
- fib6_select_path - update to use the new nexthop_path_fib6_result when
  fib entry uses a nexthop object
- rt6_device_match - update to catch use of nexthop object as a blackhole
  and set fib6_type and flags.
- ip6_route_info_create - don't add space for fib6_nh if fib entry is
  going to reference a nexthop object, take a reference to nexthop object,
  disallow use of source routing
- rt6_nlmsg_size - add space for RTA_NH_ID
- add rt6_fill_node_nexthop to add nexthop data on a dump

As with ipv4, most of the changes push existing code into the else branch
of whether the fib entry uses a nexthop object.

Update the nexthop code to walk f6i_list on a nexthop deleted to remove
fib entries referencing it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:50 -07:00
David Ahern
4c7e8084fd ipv4: Plumb support for nexthop object in a fib_info
Add 'struct nexthop' and nh_list list_head to fib_info. nh_list is the
fib_info side of the nexthop <-> fib_info relationship.

Add fi_list list_head to 'struct nexthop' to track fib_info entries
using a nexthop instance. Add __remove_nexthop_fib and add it to
__remove_nexthop to walk the new list_head and mark those fib entries
as dead when the nexthop is deleted.

Add a few nexthop helpers for use when a nexthop is added to fib_info:
- nexthop_cmp to determine if 2 nexthops are the same
- nexthop_path_fib_result to select a path for a multipath
  'struct nexthop'
- nexthop_fib_nhc to select a specific fib_nh_common within a
  multipath 'struct nexthop'

Update existing fib_info_nhc to use nexthop_fib_nhc if a fib_info uses
a 'struct nexthop', and mark fib_info_nh as only used for the non-nexthop
case.

Update the fib_info functions to check for fi->nh and take a different
path as needed:
- free_fib_info_rcu - put the nexthop object reference
- fib_release_info - remove the fib_info from the nexthop's fi_list
- nh_comp - use nexthop_cmp when either fib_info references a nexthop
  object
- fib_info_hashfn - use the nexthop id for the hashing vs the oif of
  each fib_nh in a fib_info
- fib_nlmsg_size - add space for the RTA_NH_ID attribute
- fib_create_info - verify nexthop reference can be taken, verify
  nexthop spec is valid for fib entry, and add fib_info to fi_list for
  a nexthop
- fib_select_multipath - use the new nexthop_path_fib_result to select a
  path when nexthop objects are used
- fib_table_lookup - if the 'struct nexthop' is a blackhole nexthop, treat
  it the same as a fib entry using 'blackhole'

The bulk of the changes are in fib_semantics.c and most of that is
moving the existing change_nexthops into an else branch.

Update the nexthop code to walk fi_list on a nexthop deleted to remove
fib entries referencing it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:49 -07:00
David Ahern
dcb1ecb50e ipv4: Prepare for fib6_nh from a nexthop object
Convert more IPv4 code to use fib_nh_common over fib_nh to enable routes
to use a fib6_nh based nexthop. In the end, only code not using a
nexthop object in a fib_info should directly access fib_nh in a fib_info
without checking the famiy and going through fib_nh_common. Those
functions will be marked when it is not directly evident.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:49 -07:00
David Ahern
5481d73f81 ipv4: Use accessors for fib_info nexthop data
Use helpers to access fib_nh and fib_nhs fields of a fib_info. Drop the
fib_dev macro which is an alias for the first nexthop. Replacements:

  fi->fib_dev    --> fib_info_nh(fi, 0)->fib_nh_dev
  fi->fib_nh     --> fib_info_nh(fi, 0)
  fi->fib_nh[i]  --> fib_info_nh(fi, i)
  fi->fib_nhs    --> fib_info_num_path(fi)

where fib_info_nh(fi, i) returns fi->fib_nh[nhsel] and fib_info_num_path
returns fi->fib_nhs.

Move the existing fib_info_nhc to nexthop.h and define the new ones
there. A later patch adds a check if a fib_info uses a nexthop object,
and defining the helpers in nexthop.h avoid circular header
dependencies.

After this all remaining open coded references to fi->fib_nhs and
fi->fib_nh are in:
- fib_create_info and helpers used to lookup an existing fib_info
  entry, and
- the netdev event functions fib_sync_down_dev and fib_sync_up.

The latter two will not be reused for nexthops, and the fib_create_info
will be updated to handle a nexthop in a fib_info.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 19:26:49 -07:00
Jakub Kicinski
fb0f886fa2 net/tls: don't pass version to tls_advance_record_sn()
All callers pass prot->version as the last parameter
of tls_advance_record_sn(), yet tls_advance_record_sn()
itself needs a pointer to prot.  Pass prot from callers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jakub Kicinski
f0aaa2c975 net/tls: reorganize struct tls_context
struct tls_context is slightly badly laid out.  If we reorder things
right we can save 16 bytes (320 -> 304) but also make all fast path
data fit into two cache lines (one read only and one read/write,
down from four cache lines).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:33:50 -07:00
Jiri Pirko
191ed2024d devlink: allow driver to update progress of flash update
Introduce a function to be called from drivers during flash. It sends
notification to userspace about flash update progress.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 14:21:40 -07:00
Jakub Kicinski
e52972c11d net/tls: replace the sleeping lock around RX resync with a bit lock
Commit 38030d7cb7 ("net/tls: avoid NULL-deref on resync during device removal")
tried to fix a potential NULL-dereference by taking the
context rwsem.  Unfortunately the RX resync may get called
from soft IRQ, so we can't use the rwsem to protect from
the device disappearing.  Because we are guaranteed there
can be only one resync at a time (it's called from strparser)
use a bit to indicate resync is busy and make device
removal wait for the bit to get cleared.

Note that there is a leftover "flags" field in struct
tls_context already.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-04 13:34:37 -07:00
Eric Dumazet
b703414675 net: fix use-after-free in kfree_skb_list
syzbot reported nasty use-after-free [1]

Lets remove frag_list field from structs ip_fraglist_iter
and ip6_fraglist_iter. This seens not needed anyway.

[1] :
BUG: KASAN: use-after-free in kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
Read of size 8 at addr ffff888085a3cbc0 by task syz-executor303/8947

CPU: 0 PID: 8947 Comm: syz-executor303 Not tainted 5.2.0-rc2+ #12
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 kfree_skb_list+0x5d/0x60 net/core/skbuff.c:706
 ip6_fragment+0x1ef4/0x2680 net/ipv6/ip6_output.c:882
 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
 dst_output include/net/dst.h:433 [inline]
 ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
 rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x44add9
Code: e8 7c e6 ff ff 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 1b 05 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f826f33bce8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
RAX: ffffffffffffffda RBX: 00000000006e7a18 RCX: 000000000044add9
RDX: 0000000000000000 RSI: 0000000020000240 RDI: 0000000000000005
RBP: 00000000006e7a10 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006e7a1c
R13: 00007ffcec4f7ebf R14: 00007f826f33c9c0 R15: 20c49ba5e353f7cf

Allocated by task 8947:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_slab_alloc+0xf/0x20 mm/kasan/common.c:497
 slab_post_alloc_hook mm/slab.h:437 [inline]
 slab_alloc_node mm/slab.c:3269 [inline]
 kmem_cache_alloc_node+0x131/0x710 mm/slab.c:3579
 __alloc_skb+0xd5/0x5e0 net/core/skbuff.c:199
 alloc_skb include/linux/skbuff.h:1058 [inline]
 __ip6_append_data.isra.0+0x2a24/0x3640 net/ipv6/ip6_output.c:1519
 ip6_append_data+0x1e5/0x320 net/ipv6/ip6_output.c:1688
 rawv6_sendmsg+0x1467/0x35e0 net/ipv6/raw.c:940
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 8947:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kmem_cache_free+0x86/0x260 mm/slab.c:3698
 kfree_skbmem net/core/skbuff.c:625 [inline]
 kfree_skbmem+0xc5/0x150 net/core/skbuff.c:619
 __kfree_skb net/core/skbuff.c:682 [inline]
 kfree_skb net/core/skbuff.c:699 [inline]
 kfree_skb+0xf0/0x390 net/core/skbuff.c:693
 kfree_skb_list+0x44/0x60 net/core/skbuff.c:708
 __dev_xmit_skb net/core/dev.c:3551 [inline]
 __dev_queue_xmit+0x3034/0x36b0 net/core/dev.c:3850
 dev_queue_xmit+0x18/0x20 net/core/dev.c:3914
 neigh_direct_output+0x16/0x20 net/core/neighbour.c:1532
 neigh_output include/net/neighbour.h:511 [inline]
 ip6_finish_output2+0x1034/0x2550 net/ipv6/ip6_output.c:120
 ip6_fragment+0x1ebb/0x2680 net/ipv6/ip6_output.c:863
 __ip6_finish_output+0x577/0xaa0 net/ipv6/ip6_output.c:144
 ip6_finish_output+0x38/0x1f0 net/ipv6/ip6_output.c:156
 NF_HOOK_COND include/linux/netfilter.h:294 [inline]
 ip6_output+0x235/0x7f0 net/ipv6/ip6_output.c:179
 dst_output include/net/dst.h:433 [inline]
 ip6_local_out+0xbb/0x1b0 net/ipv6/output_core.c:179
 ip6_send_skb+0xbb/0x350 net/ipv6/ip6_output.c:1796
 ip6_push_pending_frames+0xc8/0xf0 net/ipv6/ip6_output.c:1816
 rawv6_push_pending_frames net/ipv6/raw.c:617 [inline]
 rawv6_sendmsg+0x2993/0x35e0 net/ipv6/raw.c:947
 inet_sendmsg+0x141/0x5d0 net/ipv4/af_inet.c:802
 sock_sendmsg_nosec net/socket.c:652 [inline]
 sock_sendmsg+0xd7/0x130 net/socket.c:671
 ___sys_sendmsg+0x803/0x920 net/socket.c:2292
 __sys_sendmsg+0x105/0x1d0 net/socket.c:2330
 __do_sys_sendmsg net/socket.c:2339 [inline]
 __se_sys_sendmsg net/socket.c:2337 [inline]
 __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2337
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888085a3cbc0
 which belongs to the cache skbuff_head_cache of size 224
The buggy address is located 0 bytes inside of
 224-byte region [ffff888085a3cbc0, ffff888085a3cca0)
The buggy address belongs to the page:
page:ffffea0002168f00 refcount:1 mapcount:0 mapping:ffff88821b6f63c0 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea00027bbf88 ffffea0002105b88 ffff88821b6f63c0
raw: 0000000000000000 ffff888085a3c080 000000010000000c 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888085a3ca80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
 ffff888085a3cb00: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc
>ffff888085a3cb80: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                           ^
 ffff888085a3cc00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff888085a3cc80: fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc fc

Fixes: 0feca6190f ("net: ipv6: add skbuff fraglist splitter")
Fixes: c8b17be0b7 ("net: ipv4: add skbuff fraglist splitter")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 15:18:00 -07:00
Edward Cree
fa85999f49 flow_offload: include linux/kernel.h from flow_offload.h
flow_stats_update() uses max_t, so ensure we have that defined.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 14:57:04 -07:00
Stanislav Fomichev
1cc26450a8 flow_dissector: remove unused FLOW_DISSECTOR_F_STOP_AT_L3 flag
This flag is not used by any caller, remove it.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-03 14:56:35 -07:00
David S. Miller
c1e9e01d42 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset container Netfilter/IPVS update for net-next:

1) Add UDP tunnel support for ICMP errors in IPVS.

Julian Anastasov says:

This patchset is a followup to the commit that adds UDP/GUE tunnel:
"ipvs: allow tunneling with gue encapsulation".

What we do is to put tunnel real servers in hash table (patch 1),
add function to lookup tunnels (patch 2) and use it to strip the
embedded tunnel headers from ICMP errors (patch 3).

2) Extend xt_owner to match for supplementary groups, from
   Lukasz Pawelczyk.

3) Remove unused oif field in flow_offload_tuple object, from
   Taehee Yoo.

4) Release basechain counters from workqueue to skip synchronize_rcu()
   call. From Florian Westphal.

5) Replace skb_make_writable() by skb_ensure_writable(). Patchset
   from Florian Westphal.

6) Checksum support for gue encapsulation in IPVS, from Jacky Hu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-06-01 16:21:19 -07:00
David S. Miller
b4b12b0d2f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The phylink conflict was between a bug fix by Russell King
to make sure we have a consistent PHY interface mode, and
a change in net-next to pull some code in phylink_resolve()
into the helper functions phylink_mac_link_{up,down}()

On the dp83867 side it's mostly overlapping changes, with
the 'net' side removing a condition that was supposed to
trigger for RGMII but because of how it was coded never
actually could trigger.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-31 10:49:43 -07:00
Jacky Hu
29930e314d ipvs: add checksum support for gue encapsulation
Add checksum support for gue encapsulation with the tun_flags parameter,
which could be one of the values below:
IP_VS_TUNNEL_ENCAP_FLAG_NOCSUM
IP_VS_TUNNEL_ENCAP_FLAG_CSUM
IP_VS_TUNNEL_ENCAP_FLAG_REMCSUM

Signed-off-by: Jacky Hu <hengqing.hu@gmail.com>
Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:23:52 +02:00
Taehee Yoo
5e2ad02e90 netfilter: nf_flow_table: remove unnecessary variable in flow_offload_tuple
The oifidx in the struct flow_offload_tuple is not used anymore.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 18:02:42 +02:00
Julian Anastasov
2aa3c9f48b ipvs: add function to find tunnels
Add ip_vs_find_tunnel() to match tunnel headers
by family, address and optional port. Use it to
properly find the tunnel real server used in
received ICMP errors.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 17:48:09 +02:00
Julian Anastasov
1da40ab6ca ipvs: allow rs_table to contain different real server types
Before now rs_table was used only for NAT real servers.
Change it to allow TUN real severs from different types,
possibly hashed with different port key.

Signed-off-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-31 17:47:57 +02:00
Linus Torvalds
2f4c533499 SPDX update for 5.2-rc3, round 1
Here is another set of reviewed patches that adds SPDX tags to different
 kernel files, based on a set of rules that are being used to parse the
 comments to try to determine that the license of the file is
 "GPL-2.0-or-later" or "GPL-2.0-only".  Only the "obvious" versions of
 these matches are included here, a number of "non-obvious" variants of
 text have been found but those have been postponed for later review and
 analysis.
 
 There is also a patch in here to add the proper SPDX header to a bunch
 of Kbuild files that we have missed in the past due to new files being
 added and forgetting that Kbuild uses two different file names for
 Makefiles.  This issue was reported by the Kbuild maintainer.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXPCHLg8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ykxyACgql6ktH+Tv8Ho1747kKPiFca1Jq0AoK5HORXI
 yB0DSTXYNjMtH41ypnsZ
 =x2f8
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull yet more SPDX updates from Greg KH:
 "Here is another set of reviewed patches that adds SPDX tags to
  different kernel files, based on a set of rules that are being used to
  parse the comments to try to determine that the license of the file is
  "GPL-2.0-or-later" or "GPL-2.0-only". Only the "obvious" versions of
  these matches are included here, a number of "non-obvious" variants of
  text have been found but those have been postponed for later review
  and analysis.

  There is also a patch in here to add the proper SPDX header to a bunch
  of Kbuild files that we have missed in the past due to new files being
  added and forgetting that Kbuild uses two different file names for
  Makefiles. This issue was reported by the Kbuild maintainer.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers"

* tag 'spdx-5.2-rc3-1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (82 commits)
  treewide: Add SPDX license identifier - Kbuild
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 225
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 224
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 223
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 222
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 221
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 220
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 218
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 217
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 216
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 215
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 214
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 213
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 211
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 210
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 209
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 207
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 206
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 203
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201
  ...
2019-05-31 08:34:32 -07:00
Linus Torvalds
036e343109 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix OOPS during nf_tables rule dump, from Florian Westphal.

 2) Use after free in ip_vs_in, from Yue Haibing.

 3) Fix various kTLS bugs (NULL deref during device removal resync,
    netdev notification ignoring, etc.) From Jakub Kicinski.

 4) Fix ipv6 redirects with VRF, from David Ahern.

 5) Memory leak fix in igmpv3_del_delrec(), from Eric Dumazet.

 6) Missing memory allocation failure check in ip6_ra_control(), from
    Gen Zhang. And likewise fix ip_ra_control().

 7) TX clean budget logic error in aquantia, from Igor Russkikh.

 8) SKB leak in llc_build_and_send_ui_pkt(), from Eric Dumazet.

 9) Double frees in mlx5, from Parav Pandit.

10) Fix lost MAC address in r8169 during PCI D3, from Heiner Kallweit.

11) Fix botched register access in mvpp2, from Antoine Tenart.

12) Use after free in napi_gro_frags(), from Eric Dumazet.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (89 commits)
  net: correct zerocopy refcnt with udp MSG_MORE
  ethtool: Check for vlan etype or vlan tci when parsing flow_rule
  net: don't clear sock->sk early to avoid trouble in strparser
  net-gro: fix use-after-free read in napi_gro_frags()
  net: dsa: tag_8021q: Create a stable binary format
  net: dsa: tag_8021q: Change order of rx_vid setup
  net: mvpp2: fix bad MVPP2_TXQ_SCHED_TOKEN_CNTR_REG queue value
  ipv4: tcp_input: fix stack out of bounds when parsing TCP options.
  mlxsw: spectrum: Prevent force of 56G
  mlxsw: spectrum_acl: Avoid warning after identical rules insertion
  net: dsa: mv88e6xxx: fix handling of upper half of STATS_TYPE_PORT
  r8169: fix MAC address being lost in PCI D3
  net: core: support XDP generic on stacked devices.
  netvsc: unshare skb in VF rx handler
  udp: Avoid post-GRO UDP checksum recalculation
  net: phy: dp83867: Set up RGMII TX delay
  net: phy: dp83867: do not call config_init twice
  net: phy: dp83867: increase SGMII autoneg timer duration
  net: phy: dp83867: fix speed 10 in sgmii mode
  net: phy: marvell10g: report if the PHY fails to boot firmware
  ...
2019-05-30 21:11:22 -07:00
Matteo Croce
c3e933a5b8 sctp: deduplicate identical skb_checksum_ops
The same skb_checksum_ops struct is defined twice in two different places,
leading to code duplication. Declare it as a global variable into a common
header instead of allocating it on the stack on each function call.
bloat-o-meter reports a slight code shrink.

add/remove: 1/1 grow/shrink: 0/10 up/down: 128/-1282 (-1154)
Function                                     old     new   delta
sctp_csum_ops                                  -     128    +128
crc32c_csum_ops                               16       -     -16
sctp_rcv                                    6616    6583     -33
sctp_packet_pack                            4542    4504     -38
nf_conntrack_sctp_packet                    4980    4926     -54
execute_masked_set_action                   6453    6389     -64
tcf_csum_sctp                                575     428    -147
sctp_gso_segment                            1292    1126    -166
sctp_csum_check                              579     412    -167
sctp_snat_handler                            957     772    -185
sctp_dnat_handler                           1321    1132    -189
l4proto_manip_pkt                           2536    2313    -223
Total: Before=359297613, After=359296459, chg -0.00%

Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:35:44 -07:00
Pablo Neira Ayuso
3c171f496e netfilter: bridge: add connection tracking system
This patch adds basic connection tracking support for the bridge,
including initial IPv4 support.

This patch register two hooks to deal with the bridge forwarding path,
one from the bridge prerouting hook to call nf_conntrack_in(); and
another from the bridge postrouting hook to confirm the entry.

The conntrack bridge prerouting hook defragments packets before passing
them to nf_conntrack_in() to look up for an existing entry, otherwise a
new entry is allocated and it is attached to the skbuff. The conntrack
bridge postrouting hook confirms new conntrack entries, ie. if this is
the first packet seen, then it adds the entry to the hashtable and (if
needed) it refragments the skbuff into the original fragments, leaving
the geometry as is if possible. Exceptions are linearized skbuffs, eg.
skbuffs that are passed up to nfqueue and conntrack helpers, as well as
cloned skbuff for the local delivery (eg. tcpdump), also in case of
bridge port flooding (cloned skbuff too).

The packet defragmentation is done through the ip_defrag() call.  This
forces us to save the bridge control buffer, reset the IP control buffer
area and then restore it after call. This function also bumps the IP
fragmentation statistics, it would be probably desiderable to have
independent statistics for the bridge defragmentation/refragmentation.
The maximum fragment length is stored in the control buffer and it is
used to refragment the skbuff from the postrouting path.

The new fraglist splitter and fragment transformer APIs are used to
implement the bridge refragmentation code. The br_ip_fragment() function
drops the packet in case the maximum fragment size seen is larger than
the output port MTU.

This patchset follows the principle that conntrack should not drop
packets, so users can do it through policy via invalid state matching.

Like br_netfilter, there is no refragmentation for packets that are
passed up for local delivery, ie. prerouting -> input path. There are
calls to nf_reset() already in several spots in the stack since time ago
already, eg. af_packet, that show that skbuff fraglist handling from the
netif_rx path is supported already.

The helpers are called from the postrouting hook, before confirmation,
from there we may see packet floods to bridge ports. Then, although
unlikely, this may result in exercising the helpers many times for each
clone. It would be good to explore how to pass all the packets in a list
to the conntrack hook to do this handle only once for this case.

Thanks to Florian Westphal for handing me over an initial patchset
version to add support for conntrack bridge.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
d035f19f59 netfilter: nf_conntrack: allow to register bridge support
This patch adds infrastructure to register and to unregister bridge
support for the conntrack module via nf_ct_bridge_register() and
nf_ct_bridge_unregister().

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:18 -07:00
Pablo Neira Ayuso
8a6a1f1764 net: ipv6: split skbuff into fragments transformer
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:

* ip6_frag_init(), that initializes the internal state of the transformer.
* ip6_frag_next(), that allows you to fetch the next fragment. This function
  internally allocates the skbuff that represents the fragment, it pushes
  the IPv6 header, and it also copies the payload for each fragment.

The ip6_frag_state object stores the internal state of the splitter.

This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Pablo Neira Ayuso
065ff79f88 net: ipv4: split skbuff into fragments transformer
This patch exposes a new API to refragment a skbuff. This allows you to
split either a linear skbuff or to force the refragmentation of an
existing fraglist using a different mtu. The API consists of:

* ip_frag_init(), that initializes the internal state of the transformer.
* ip_frag_next(), that allows you to fetch the next fragment. This function
  internally allocates the skbuff that represents the fragment, it pushes
  the IPv4 header, and it also copies the payload for each fragment.

The ip_frag_state object stores the internal state of the splitter.

This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Pablo Neira Ayuso
0feca6190f net: ipv6: add skbuff fraglist splitter
This patch adds the skbuff fraglist split iterator. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:

* ip6_fraglist_init(), that initializes the internal state of the
  fraglist iterator.
* ip6_fraglist_prepare(), that restores the IPv6 header on the fragment.
* ip6_fraglist_next(), that retrieves the fragment from the fraglist and
  updates the internal state of the iterator to point to the next
  fragment in the fraglist.

The ip6_fraglist_iter object stores the internal state of the iterator.

This code has been extracted from ip6_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Pablo Neira Ayuso
c8b17be0b7 net: ipv4: add skbuff fraglist splitter
This patch adds the skbuff fraglist splitter. This API provides an
iterator to transform the fraglist into single skbuff objects, it
consists of:

* ip_fraglist_init(), that initializes the internal state of the
  fraglist splitter.
* ip_fraglist_prepare(), that restores the IPv4 header on the
  fragments.
* ip_fraglist_next(), that retrieves the fragment from the fraglist and
  it updates the internal state of the splitter to point to the next
  fragment skbuff in the fraglist.

The ip_fraglist_iter object stores the internal state of the iterator.

This code has been extracted from ip_do_fragment(). Symbols are also
exported to allow to reuse this iterator from the bridge codepath to
build its own refragmentation routine by reusing the existing codebase.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 14:18:17 -07:00
Jason Baron
9092a76d3c tcp: add backup TFO key infrastructure
We would like to be able to rotate TFO keys while minimizing the number of
client cookies that are rejected. Currently, we have only one key which can
be used to generate and validate cookies, thus if we simply replace this
key clients can easily have cookies rejected upon rotation.

We propose having the ability to have both a primary key and a backup key.
The primary key is used to generate as well as to validate cookies.
The backup is only used to validate cookies. Thus, keys can be rotated as:

1) generate new key
2) add new key as the backup key
3) swap the primary and backup key, thus setting the new key as the primary

We don't simply set the new key as the primary key and move the old key to
the backup slot because the ip may be behind a load balancer and we further
allow for the fact that all machines behind the load balancer will not be
updated simultaneously.

We make use of this infrastructure in subsequent patches.

Suggested-by: Igor Lubashev <ilubashe@akamai.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Christoph Paasch <cpaasch@apple.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 13:41:26 -07:00
Thomas Gleixner
9952f6918d treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 201
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms and conditions of the gnu general public license
  version 2 as published by the free software foundation this program
  is distributed in the hope it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not see http www gnu org
  licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 228 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528171438.107155473@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:52 -07:00
Thomas Gleixner
af873fcece treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 194
Based on 1 normalized pattern(s):

  license terms gnu general public license gpl version 2

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 161 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170027.447718015@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:22 -07:00
Thomas Gleixner
1f32761322 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 188
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with this program if not write to free software
  foundation 51 franklin street fifth floor boston ma 02111 1301 usa

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 27 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Alexios Zavras <alexios.zavras@intel.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190528170026.981318839@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:29:21 -07:00
Thomas Gleixner
1802d0beec treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license version 2 as
  published by the free software foundation this program is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-only

has been chosen to replace the boilerplate/reference in 655 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070034.575739538@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:41 -07:00
Thomas Gleixner
2874c5fd28 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3029 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-30 11:26:32 -07:00
Sean Tranchetti
f2696099c6 udp: Avoid post-GRO UDP checksum recalculation
Currently, when resegmenting an unexpected UDP GRO packet, the full UDP
checksum will be calculated for every new SKB created by skb_segment()
because the netdev features passed in by udp_rcv_segment() lack any
information about checksum offload capabilities.

Usually, we have no need to perform this calculation again, as
  1) The GRO implementation guarantees that any packets making it to the
     udp_rcv_segment() function had correct checksums, and, more
     importantly,
  2) Upon the successful return of udp_rcv_segment(), we immediately pull
     the UDP header off and either queue the segment to the socket or
     hand it off to a new protocol handler.

Unless userspace has set the IP_CHECKSUM sockopt to indicate that they
want the final checksum values, we can pass the needed netdev feature
flags to __skb_gso_segment() to avoid checksumming each segment in
skb_segment().

Fixes: cf329aa42b ("udp: cope with UDP GRO packet misdirection")
Cc: Paolo Abeni <pabeni@redhat.com>
Cc: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org>
Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-30 11:10:39 -07:00
Ioana Ciornei
44cc27e43f net: phylink: Add struct phylink_config to PHYLINK API
The phylink_config structure will encapsulate a pointer to a struct
device and the operation type requested for this instance of PHYLINK.
This patch does not make any functional changes, it just transitions the
PHYLINK internals and all its users to the new API.

A pointer to a phylink_config structure will be passed to
phylink_create() instead of the net_device directly. Also, the same
phylink_config pointer will be passed back to all phylink_mac_ops
callbacks instead of the net_device. Using this mechanism, a PHYLINK
user can get the original net_device using a structure such as
'to_net_dev(config->dev)' or directly the structure containing the
phylink_config using a container_of call.

At the moment, only the PHYLINK_NETDEV is defined as a valid operation
type for PHYLINK. In this mode, a valid reference to a struct device
linked to the original net_device should be passed to PHYLINK through
the phylink_config structure.

This API changes is mainly driven by the necessity of adding a new
operation type in PHYLINK that disconnects the phy_device from the
net_device and also works when the net_device is lacking.

Signed-off-by: Ioana Ciornei <ioana.ciornei@nxp.com>
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Tested-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-29 21:48:53 -07:00
Kevin 'ldir' Darbyshire-Bryant
24ec483cec net: sched: Introduce act_ctinfo action
ctinfo is a new tc filter action module.  It is designed to restore
information contained in firewall conntrack marks to other packet fields
and is typically used on packet ingress paths.  At present it has two
independent sub-functions or operating modes, DSCP restoration mode &
skb mark restoration mode.

The DSCP restore mode:

This mode copies DSCP values that have been placed in the firewall
conntrack mark back into the IPv4/v6 diffserv fields of relevant
packets.

The DSCP restoration is intended for use and has been found useful for
restoring ingress classifications based on egress classifications across
links that bleach or otherwise change DSCP, typically home ISP Internet
links.  Restoring DSCP on ingress on the WAN link allows qdiscs such as
but by no means limited to CAKE to shape inbound packets according to
policies that are easier to set & mark on egress.

Ingress classification is traditionally a challenging task since
iptables rules haven't yet run and tc filter/eBPF programs are pre-NAT
lookups, hence are unable to see internal IPv4 addresses as used on the
typical home masquerading gateway.  Thus marking the connection in some
manner on egress for later restoration of classification on ingress is
easier to implement.

Parameters related to DSCP restore mode:

dscpmask - a 32 bit mask of 6 contiguous bits and indicate bits of the
conntrack mark field contain the DSCP value to be restored.

statemask - a 32 bit mask of (usually) 1 bit length, outside the area
specified by dscpmask.  This represents a conditional operation flag
whereby the DSCP is only restored if the flag is set.  This is useful to
implement a 'one shot' iptables based classification where the
'complicated' iptables rules are only run once to classify the
connection on initial (egress) packet and subsequent packets are all
marked/restored with the same DSCP.  A mask of zero disables the
conditional behaviour ie. the conntrack mark DSCP bits are always
restored to the ip diffserv field (assuming the conntrack entry is found
& the skb is an ipv4/ipv6 type)

e.g. dscpmask 0xfc000000 statemask 0x01000000

|----0xFC----conntrack mark----000000---|
| Bits 31-26 | bit 25 | bit24 |~~~ Bit 0|
| DSCP       | unused | flag  |unused   |
|-----------------------0x01---000000---|
      |                   |
      |                   |
      ---|             Conditional flag
         v             only restore if set
|-ip diffserv-|
| 6 bits      |
|-------------|

The skb mark restore mode (cpmark):

This mode copies the firewall conntrack mark to the skb's mark field.
It is completely the functional equivalent of the existing act_connmark
action with the additional feature of being able to apply a mask to the
restored value.

Parameters related to skb mark restore mode:

mask - a 32 bit mask applied to the firewall conntrack mark to mask out
bits unwanted for restoration.  This can be useful where the conntrack
mark is being used for different purposes by different applications.  If
not specified and by default the whole mark field is copied (i.e.
default mask of 0xffffffff)

e.g. mask 0x00ffffff to mask out the top 8 bits being used by the
aforementioned DSCP restore mode.

|----0x00----conntrack mark----ffffff---|
| Bits 31-24 |                          |
| DSCP & flag|      some value here     |
|---------------------------------------|
			|
			|
			v
|------------skb mark-------------------|
|            |                          |
|  zeroed    |                          |
|---------------------------------------|

Overall parameters:

zone - conntrack zone

control - action related control (reclassify | pipe | drop | continue |
ok | goto chain <CHAIN_INDEX>)

Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-29 21:43:54 -07:00
David Ahern
430a049190 nexthop: Add support for nexthop groups
Allow the creation of nexthop groups which reference other nexthop
objects to create multipath routes:

                      +--------------+
   +------------+   +--------------+ |
   | nh  nh_grp --->| nh_grp_entry |-+
   +------------+   +---------|----+
     ^                |       |    +------------+
     +----------------+       +--->| nh, weight |
        nh_parent                  +------------+

A group entry points to a nexthop with a weight for that hop within the
group. The nexthop has a list_head, grp_list, for tracking which groups
it is a member of and the group entry has a reference back to the parent.
The grp_list is used when a nexthop is deleted - to efficiently remove
it from groups using it.

If a nexthop group spec is given, no other attributes can be set. Each
nexthop id in a group spec must already exist.

Similar to single nexthops, the specification of a nexthop group can be
updated so that data is managed with rcu locking.

Add path selection function to account for multiple paths and add
ipv{4,6}_good_nh helpers to know that if a neighbor entry exists it is
in a good state.

Update NETDEV event handling to rebalance multipath nexthop groups if
a nexthop is deleted due to a link event (down or unregister).

When a nexthop is removed any groups using it are updated. Groups using a
nexthop a tracked via a grp_list.

Nexthop dumps can be limited to groups only by adding NHA_GROUPS to the
request.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
b513bd035f nexthop: Add support for lwt encaps
Add support for NHA_ENCAP and NHA_ENCAP_TYPE. Leverages the existing code
for lwtunnel within fib_nh_common, so the only change needed is handling
the attributes in the nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
53010f991a nexthop: Add support for IPv6 gateways
Handle IPv6 gateway in a nexthop spec. If nh_family is set to AF_INET6,
NHA_GATEWAY is expected to be an IPv6 address. Add ipv6 option to gw in
nh_config to hold the address, add fib6_nh to nh_info to leverage the
ipv6 initialization and cleanup code. Update nh_fill_node to dump the v6
address.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
597cfe4fc3 nexthop: Add support for IPv4 nexthops
Add support for IPv4 nexthops. If nh_family is set to AF_INET, then
NHA_GATEWAY is expected to be an IPv4 address.

Register for netdev events to be notified of admin up/down changes as
well as deletes. A hash table is used to track nexthop per devices to
quickly convert device events to the affected nexthops.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
David Ahern
ab84be7e54 net: Initial nexthop code
Barebones start point for nexthops. Implementation for RTM commands,
notifications, management of rbtree for holding nexthops by id, and
kernel side data structures for nexthops and nexthop config.

Nexthops are maintained in an rbtree sorted by id. Similar to routes,
nexthops are configured per namespace using netns_nexthop struct added
to struct net.

Nexthop notifications are sent when a nexthop is added or deleted,
but NOT if the delete is due to a device event or network namespace
teardown (which also involves device events). Applications are
expected to use the device down event to flush nexthops and any
routes used by the nexthops.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 21:37:30 -07:00
Eric Dumazet
dc93f46bc4 inet: frags: fix use-after-free read in inet_frag_destroy_rcu
As caught by syzbot [1], the rcu grace period that is respected
before fqdir_rwork_fn() proceeds and frees fqdir is not enough
to prevent inet_frag_destroy_rcu() being run after the freeing.

We need a proper rcu_barrier() synchronization to replace
the one we had in inet_frags_fini()

We also have to fix a potential problem at module removal :
inet_frags_fini() needs to make sure that all queued work queues
(fqdir_rwork_fn) have completed, otherwise we might
call kmem_cache_destroy() too soon and get another use-after-free.

[1]
BUG: KASAN: use-after-free in inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
Read of size 8 at addr ffff88806ed47a18 by task swapper/1/0

CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 inet_frag_destroy_rcu+0xd9/0xe0 net/ipv4/inet_fragment.c:201
 __rcu_reclaim kernel/rcu/rcu.h:222 [inline]
 rcu_do_batch kernel/rcu/tree.c:2092 [inline]
 invoke_rcu_callbacks kernel/rcu/tree.c:2310 [inline]
 rcu_core+0xba5/0x1500 kernel/rcu/tree.c:2291
 __do_softirq+0x25c/0x94c kernel/softirq.c:293
 invoke_softirq kernel/softirq.c:374 [inline]
 irq_exit+0x180/0x1d0 kernel/softirq.c:414
 exiting_irq arch/x86/include/asm/apic.h:536 [inline]
 smp_apic_timer_interrupt+0x13b/0x550 arch/x86/kernel/apic/apic.c:1068
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:806
 </IRQ>
RIP: 0010:native_safe_halt+0xe/0x10 arch/x86/include/asm/irqflags.h:61
Code: ff ff 48 89 df e8 f2 95 8c fa eb 82 e9 07 00 00 00 0f 00 2d e4 45 4b 00 f4 c3 66 90 e9 07 00 00 00 0f 00 2d d4 45 4b 00 fb f4 <c3> 90 55 48 89 e5 41 57 41 56 41 55 41 54 53 e8 8e 18 42 fa e8 99
RSP: 0018:ffff8880a98e7d78 EFLAGS: 00000282 ORIG_RAX: ffffffffffffff13
RAX: 1ffffffff1164e11 RBX: ffff8880a98d4340 RCX: 0000000000000000
RDX: dffffc0000000000 RSI: 0000000000000006 RDI: ffff8880a98d4bbc
RBP: ffff8880a98e7da8 R08: ffff8880a98d4340 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
R13: ffffffff88b27078 R14: 0000000000000001 R15: 0000000000000000
 arch_cpu_idle+0xa/0x10 arch/x86/kernel/process.c:571
 default_idle_call+0x36/0x90 kernel/sched/idle.c:94
 cpuidle_idle_call kernel/sched/idle.c:154 [inline]
 do_idle+0x377/0x560 kernel/sched/idle.c:263
 cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:354
 start_secondary+0x34e/0x4c0 arch/x86/kernel/smpboot.c:267
 secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:243

Allocated by task 8877:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 kmem_cache_alloc_trace+0x151/0x750 mm/slab.c:3555
 kmalloc include/linux/slab.h:547 [inline]
 kzalloc include/linux/slab.h:742 [inline]
 fqdir_init include/net/inet_frag.h:115 [inline]
 ipv6_frags_init_net+0x48/0x460 net/ipv6/reassembly.c:513
 ops_init+0xb3/0x410 net/core/net_namespace.c:130
 setup_net+0x2d3/0x740 net/core/net_namespace.c:316
 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 17:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 fqdir_rwork_fn+0x33/0x40 net/ipv4/inet_fragment.c:154
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff88806ed47a00
 which belongs to the cache kmalloc-512 of size 512
The buggy address is located 24 bytes inside of
 512-byte region [ffff88806ed47a00, ffff88806ed47c00)
The buggy address belongs to the page:
page:ffffea0001bb51c0 refcount:1 mapcount:0 mapping:ffff8880aa400940 index:0x0
flags: 0x1fffc0000000200(slab)
raw: 01fffc0000000200 ffffea000282a788 ffffea0001bb53c8 ffff8880aa400940
raw: 0000000000000000 ffff88806ed47000 0000000100000006 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff88806ed47900: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88806ed47980: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
>ffff88806ed47a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                            ^
 ffff88806ed47a80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff88806ed47b00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 3c8fc87820 ("inet: frags: rework rhashtable dismantle")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:22:15 -07:00
Eric Dumazet
6b73d19711 inet: frags: uninline fqdir_init()
fqdir_init() is not fast path and is getting bigger.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-28 17:22:15 -07:00
Chaitanya Tata
a3ce17d149 cfg80211: Handle bss expiry during connection
If the BSS is expired during connection, the connect result will
trigger a kernel warning. Ideally cfg80211 should hold the BSS
before the connection is attempted, but as the BSSID is not known
in case of auth/assoc MLME offload (connect op) it doesn't.

For those drivers without the connect op cfg80211 holds down the
reference so it wil not be removed from list.

Fix this by removing the warning and silently adding the BSS back to
the bss list which is return by the driver (with proper BSSID set) or
in case the BSS is already added use that.

The requirements for drivers are documented in the API's.

Signed-off-by: Chaitanya Tata <chaitanya.tata@bluwireless.co.uk>
[formatting fixes, keep old timestamp]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-28 09:35:39 +02:00
Eric Dumazet
3c8fc87820 inet: frags: rework rhashtable dismantle
syszbot found an interesting use-after-free [1] happening
while IPv4 fragment rhashtable was destroyed at netns dismantle.

While no insertions can possibly happen at the time a dismantling
netns is destroying this rhashtable, timers can still fire and
attempt to remove elements from this rhashtable.

This is forbidden, since rhashtable_free_and_destroy() has
no synchronization against concurrent inserts and deletes.

Add a new fqdir->dead flag so that timers do not attempt
a rhashtable_remove_fast() operation.

We also have to respect an RCU grace period before starting
the rhashtable_free_and_destroy() from process context,
thus we use rcu_work infrastructure.

This is a refinement of a prior rough attempt to fix this bug :
https://marc.info/?l=linux-netdev&m=153845936820900&w=2

Since the rhashtable cleanup is now deferred to a work queue,
netns dismantles should be slightly faster.

[1]
BUG: KASAN: use-after-free in __read_once_size include/linux/compiler.h:194 [inline]
BUG: KASAN: use-after-free in rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
Read of size 8 at addr ffff8880a6497b70 by task kworker/0:0/5

CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.2.0-rc1+ #2
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Workqueue: events rht_deferred_worker
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:188
 __kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 kasan_report+0x12/0x20 mm/kasan/common.c:614
 __asan_report_load8_noabort+0x14/0x20 mm/kasan/generic_report.c:132
 __read_once_size include/linux/compiler.h:194 [inline]
 rhashtable_last_table+0x162/0x180 lib/rhashtable.c:212
 rht_deferred_worker+0x111/0x2030 lib/rhashtable.c:411
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

Allocated by task 32687:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_kmalloc mm/kasan/common.c:489 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:462
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:503
 __do_kmalloc_node mm/slab.c:3620 [inline]
 __kmalloc_node+0x4e/0x70 mm/slab.c:3627
 kmalloc_node include/linux/slab.h:590 [inline]
 kvmalloc_node+0x68/0x100 mm/util.c:431
 kvmalloc include/linux/mm.h:637 [inline]
 kvzalloc include/linux/mm.h:645 [inline]
 bucket_table_alloc+0x90/0x480 lib/rhashtable.c:178
 rhashtable_init+0x3f4/0x7b0 lib/rhashtable.c:1057
 inet_frags_init_net include/net/inet_frag.h:109 [inline]
 ipv4_frags_init_net+0x182/0x410 net/ipv4/ip_fragment.c:683
 ops_init+0xb3/0x410 net/core/net_namespace.c:130
 setup_net+0x2d3/0x740 net/core/net_namespace.c:316
 copy_net_ns+0x1df/0x340 net/core/net_namespace.c:439
 create_new_namespaces+0x400/0x7b0 kernel/nsproxy.c:107
 unshare_nsproxy_namespaces+0xc2/0x200 kernel/nsproxy.c:206
 ksys_unshare+0x440/0x980 kernel/fork.c:2692
 __do_sys_unshare kernel/fork.c:2760 [inline]
 __se_sys_unshare kernel/fork.c:2758 [inline]
 __x64_sys_unshare+0x31/0x40 kernel/fork.c:2758
 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

Freed by task 7:
 save_stack+0x23/0x90 mm/kasan/common.c:71
 set_track mm/kasan/common.c:79 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:451
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:459
 __cache_free mm/slab.c:3432 [inline]
 kfree+0xcf/0x220 mm/slab.c:3755
 kvfree+0x61/0x70 mm/util.c:460
 bucket_table_free+0x69/0x150 lib/rhashtable.c:108
 rhashtable_free_and_destroy+0x165/0x8b0 lib/rhashtable.c:1155
 inet_frags_exit_net+0x3d/0x50 net/ipv4/inet_fragment.c:152
 ipv4_frags_exit_net+0x73/0x90 net/ipv4/ip_fragment.c:695
 ops_exit_list.isra.0+0xaa/0x150 net/core/net_namespace.c:154
 cleanup_net+0x3fb/0x960 net/core/net_namespace.c:553
 process_one_work+0x989/0x1790 kernel/workqueue.c:2269
 worker_thread+0x98/0xe40 kernel/workqueue.c:2415
 kthread+0x354/0x420 kernel/kthread.c:255
 ret_from_fork+0x24/0x30 arch/x86/entry/entry_64.S:352

The buggy address belongs to the object at ffff8880a6497b40
 which belongs to the cache kmalloc-1k of size 1024
The buggy address is located 48 bytes inside of
 1024-byte region [ffff8880a6497b40, ffff8880a6497f40)
The buggy address belongs to the page:
page:ffffea0002992580 refcount:1 mapcount:0 mapping:ffff8880aa400ac0 index:0xffff8880a64964c0 compound_mapcount: 0
flags: 0x1fffc0000010200(slab|head)
raw: 01fffc0000010200 ffffea0002916e88 ffffea000218fe08 ffff8880aa400ac0
raw: ffff8880a64964c0 ffff8880a6496040 0000000100000005 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff8880a6497a00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a6497a80: fb fb fb fb fb fb fb fb fc fc fc fc fc fc fc fc
>ffff8880a6497b00: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
                                                             ^
 ffff8880a6497b80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
 ffff8880a6497c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

Fixes: 648700f76b ("inet: frags: use rhashtables for reassembly units")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
4907abc605 net: dynamically allocate fqdir structures
Following patch will add rcu grace period before fqdir
rhashtable destruction, so we need to dynamically allocate
fqdir structures to not force expensive synchronize_rcu() calls
in netns dismantle path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
a39aca678a net: add a net pointer to struct fqdir
fqdir will soon be dynamically allocated.

We need to reach the struct net pointer from fqdir,
so add it, and replace the various container_of() constructs
by direct access to the new field.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
9cce45f22c net: rename inet_frags_init_net() to fdir_init()
And pass an extra parameter, since we will soon
dynamically allocate fqdir structures.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
803fdd9968 net: rename struct fqdir fields
Rename the @frags fields from structs netns_ipv4, netns_ipv6,
netns_nf_frag and netns_ieee802154_lowpan to @fqdir

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:05 -07:00
Eric Dumazet
89fb900514 net: rename inet_frags_exit_net() to fqdir_exit()
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:04 -07:00
Eric Dumazet
6ce3b4dcee inet: rename netns_frags to fqdir
1) struct netns_frags is renamed to struct fqdir
  This structure is really holding many frag queues in a hash table.

2) (struct inet_frag_queue)->net field is renamed to fqdir
  since net is generally associated to a 'struct net' pointer
  in networking stack.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-26 14:08:04 -07:00
David Ahern
1cf844c747 ipv6: Make fib6_nh optional at the end of fib6_info
Move fib6_nh to the end of fib6_info and make it an array of
size 0. Pass a flag to fib6_info_alloc indicating if the
allocation needs to add space for a fib6_nh.

The current code path always has a fib6_nh allocated with a
fib6_info; with nexthop objects they will be separate.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
cc5c073a69 ipv6: Move exception bucket to fib6_nh
Similar to the pcpu routes exceptions are really per nexthop, so move
rt6i_exception_bucket from fib6_info to fib6_nh.

To avoid additional increases to the size of fib6_nh for a 1-bit flag,
use the lowest bit in the allocated memory pointer for the flushed flag.
Add helpers for retrieving the bucket pointer to mask off the flag.

The cleanup of the exception bucket is moved to fib6_nh_release.

fib6_nh_flush_exceptions can now be called from 2 contexts:
1. deleting a fib entry
2. deleting a fib6_nh

For 1., fib6_nh_flush_exceptions is called for a specific fib6_info that
is getting deleted. All exceptions in the cache using the entry are
deleted. For 2, the fib6_nh itself is getting destroyed so
fib6_nh_flush_exceptions is called for a NULL fib6_info which means
flush all entries.

The pmtu.sh selftest exercises the affected code paths - from creating
exceptions to cleaning them up on device delete. All tests pass without
any rcu locking or memleak warnings.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
David Ahern
f40b6ae2b6 ipv6: Move pcpu cached routes to fib6_nh
rt6_info are specific instances of a fib entry and are tied to a
device and gateway - ie., a nexthop. Before nexthop objects, IPv6 fib
entries have separate fib6_info for each nexthop in a multipath route,
so the location of the pcpu cache in the fib6_info struct worked.
However, with nexthop objects a fib6_info can point to a set of nexthops
(yet another alignment of ipv6 with ipv4). Accordingly, the pcpu
cache needs to be moved to the fib6_nh struct so the cached entries
are local to the nexthop specification used to create the rt6_info.

Initialization and free of the pcpu entries moved to fib6_nh_init and
fib6_nh_release.

Change in location only, from fib6_info down to fib6_nh; no other
functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-24 13:26:44 -07:00
Thomas Gleixner
47505b8bcf treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 104
Based on 1 normalized pattern(s):

  this sctp implementation is free software you can redistribute it
  and or modify it under the terms of the gnu general public license
  as published by the free software foundation either version 2 or at
  your option any later version this sctp implementation is
  distributed in the hope that it will be useful but without any
  warranty without even the implied warranty of merchantability or
  fitness for a particular purpose see the gnu general public license
  for more details you should have received a copy of the gnu general
  public license along with gnu cc see the file copying if not see
  http www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 42 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.683323110@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:39:00 +02:00
Thomas Gleixner
1924af0445 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 101
Based on 1 normalized pattern(s):

  the sctp reference implementation is free software you can
  redistribute it and or modify it under the terms of the gnu general
  public license as published by the free software foundation either
  version 2 or at your option any later version the sctp reference
  implementation is distributed in the hope that it will be useful but
  without any warranty without even the implied warranty of
  merchantability or fitness for a particular purpose see the gnu
  general public license for more details you should have received a
  copy of the gnu general public license along with gnu cc see the
  file copying if not see http www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 1 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190523091649.408473526@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:38:59 +02:00
Thomas Gleixner
9e567af4f0 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 90
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not write to the free software foundation inc
  59 temple place suite 330 boston ma 02111 1307 usa the full gnu
  general public license is included in this distribution in the file
  called license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 4 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Richard Fontana <rfontana@redhat.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190520075211.959886972@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-24 17:37:53 +02:00
Manikanta Pubbisetty
33d915d9e8 {nl,mac}80211: allow 4addr AP operation on crypto controlled devices
As per the current design, in the case of sw crypto controlled devices,
it is the device which advertises the support for AP/VLAN iftype based
on it's ability to tranmsit packets encrypted in software
(In VLAN functionality, group traffic generated for a specific
VLAN group is always encrypted in software). Commit db3bdcb9c3
("mac80211: allow AP_VLAN operation on crypto controlled devices")
has introduced this change.

Since 4addr AP operation also uses AP/VLAN iftype, this conditional
way of advertising AP/VLAN support has broken 4addr AP mode operation on
crypto controlled devices which do not support VLAN functionality.

In the case of ath10k driver, not all firmwares have support for VLAN
functionality but all can support 4addr AP operation. Because AP/VLAN
support is not advertised for these devices, 4addr AP operations are
also blocked.

Fix this by allowing 4addr operation on devices which do not support
AP/VLAN iftype but can support 4addr AP operation (decision is based on
the wiphy flag WIPHY_FLAG_4ADDR_AP).

Cc: stable@vger.kernel.org
Fixes: db3bdcb9c3 ("mac80211: allow AP_VLAN operation on crypto controlled devices")
Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-05-24 11:26:54 +02:00
David S. Miller
71e15f76f8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree:

1) Fix crash when dumping rules after conversion to RCU,
   from Florian Westphal.

2) Fix incorrect hook reinjection from nf_queue in case NF_REPEAT,
   from Jagdish Motwani.

3) Fix check for route existence in fib extension, from Phil Sutter.

4) Fix use after free in ip_vs_in() hook, from YueHaibing.

5) Check for veth existence from netfilter selftests,
   from Jeffrin Jose T.

6) Checksum corruption in UDP NAT helpers due to typo,
   from Florian Westphal.

7) Pass up packets to classic forwarding path regardless of
   IPv4 DF bit, patch for the flowtable infrastructure from Florian.

8) Set liberal TCP tracking for flows that are placed in the
   flowtable, in case they need to go back to classic forwarding path,
   also from Florian.

9) Don't add flow with sequence adjustment to flowtable, from Florian.

10) Skip IPv4 options from IPv6 datapath in flowtable, from Florian.

11) Add selftest for the flowtable infrastructure, from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-23 14:45:36 -07:00
Jiri Pirko
136bf27fc0 devlink: add warning in case driver does not set port type
Prevent misbehavior of drivers who would not set port type for longer
period of time. Drivers should always set port type. Do WARN if that
happens.

Note that it is perfectly fine to temporarily not have the type set,
during initialization and port type change.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-23 09:18:43 -07:00
David Ahern
75425657fe net: Set strict_start_type for routes and rules
New userspace on an older kernel can send unknown and unsupported
attributes resulting in an incompelete config which is almost
always wrong for routing (few exceptions are passthrough settings
like the protocol that installed the route).

Set strict_start_type in the policies for IPv4 and IPv6 routes and
rules to detect new, unsupported attributes and fail the route add.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:50:24 -07:00
David Ahern
06c77c3e67 ipv4: Rename and export nh_update_mtu
Rename nh_update_mtu to fib_nhc_update_mtu and export for use by the
nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
c3669486b5 ipv4: export fib_info_update_nh_saddr
Add scope as input argument versus relying on fib_info reference in
fib_nh, and export fib_info_update_nh_saddr.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
9bd8366792 ipv4: export fib_flush
As nexthops are deleted, fib entries referencing it are marked dead.
Export fib_flush so those entries can be removed in a timely manner.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
ac1fab2d13 ipv4: export fib_check_nh
Change fib_check_nh to take net, table and scope as input arguments
over struct fib_config and export for use by nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
1bff1a0c9b ipv4: Add function to send route updates
Add fib_info_notify_update to walk the fib and send RTM_NEWROUTE
notifications with NLM_F_REPLACE set for entries linked to a fib_info
that have nh_updated flag set. This helper will be used by the nexthop
code to notify userspace of routes that are impacted when a nexthop
config is updated via replace. The new function and its helper are
similar to how fib_flush and fib_table_flush work for address delete
and link down events.

This notification is needed for legacy apps that do not understand
the new nexthop object. Apps that are nexthop aware can use the
RTA_NH_ID attribute in the route notification to just ignore it.

In the future this should be wrapped in a sysctl to allow OS'es that
are fully updated to avoid the notificaton storm.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:44 -07:00
David Ahern
19a3b7eea4 ipv6: export function to send route updates
Add fib6_rt_update to send RTM_NEWROUTE with NLM_F_REPLACE set. This
helper will be used by the nexthop code to notify userspace of routes
that are impacted when a nexthop config is updated via replace.

This notification is needed for legacy apps that do not understand
the new nexthop object. Apps that are nexthop aware can use the
RTA_NH_ID attribute in the route notification to just ignore it.

In the future this should be wrapped in a sysctl to allow OS'es that
are fully updated to avoid the notificaton storm.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:43 -07:00
David Ahern
cdaa16a4f7 ipv6: Add hook to bump sernum for a route to stubs
Add hook to ipv6 stub to bump the sernum up to the root node for a
route. This is needed by the nexthop code when a nexthop config changes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:43 -07:00
David Ahern
68a9b13d92 ipv6: Add delete route hook to stubs
Add ip6_del_rt to the IPv6 stub. The hook is needed by the nexthop
code to remove entries linked to a nexthop that is getting deleted.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-22 17:48:43 -07:00
Linus Torvalds
2c1212de6f SPDX update for 5.2-rc2, round 1
Here are series of patches that add SPDX tags to different kernel files,
 based on two different things:
   - SPDX entries are added to a bunch of files that we missed a year ago
     that do not have any license information at all.
 
     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the last
     big sweep, or they were Makefile/Kconfig files, which we didn't
     touch last time.
 
   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself.  Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.
 
 These patches have been out for review on the linux-spdx@vger mailing
 list, and while they were created by automatic tools, they were
 hand-verified by a bunch of different people, all whom names are on the
 patches are reviewers.
 
 The reason for these "large" patches is if we were to continue to
 progress at the current rate of change in the kernel, adding license
 tags to individual files in different subsystems, we would be finished
 in about 10 years at the earliest.
 
 There will be more series of these types of patches coming over the next
 few weeks as the tools and reviewers crunch through the more "odd"
 variants of how to say "GPLv2" that developers have come up with over
 the years, combined with other fun oddities (GPL + a BSD disclaimer?)
 that are being unearthed, with the goal for the whole kernel to be
 cleaned up.
 
 These diffstats are not small, 3840 files are touched, over 10k lines
 removed in just 24 patches.
 
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 -----BEGIN PGP SIGNATURE-----
 
 iG0EABECAC0WIQT0tgzFv3jCIUoxPcsxR9QN2y37KQUCXOP8uw8cZ3JlZ0Brcm9h
 aC5jb20ACgkQMUfUDdst+ynmGQCgy3evqzleuOITDpuWaxewFdHqiJYAnA7KRw4H
 1KwtfRnMtG6dk/XaS7H7
 =O9lH
 -----END PGP SIGNATURE-----

Merge tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core

Pull SPDX update from Greg KH:
 "Here is a series of patches that add SPDX tags to different kernel
  files, based on two different things:

   - SPDX entries are added to a bunch of files that we missed a year
     ago that do not have any license information at all.

     These were either missed because the tool saw the MODULE_LICENSE()
     tag, or some EXPORT_SYMBOL tags, and got confused and thought the
     file had a real license, or the files have been added since the
     last big sweep, or they were Makefile/Kconfig files, which we
     didn't touch last time.

   - Add GPL-2.0-only or GPL-2.0-or-later tags to files where our scan
     tools can determine the license text in the file itself. Where this
     happens, the license text is removed, in order to cut down on the
     700+ different ways we have in the kernel today, in a quest to get
     rid of all of these.

  These patches have been out for review on the linux-spdx@vger mailing
  list, and while they were created by automatic tools, they were
  hand-verified by a bunch of different people, all whom names are on
  the patches are reviewers.

  The reason for these "large" patches is if we were to continue to
  progress at the current rate of change in the kernel, adding license
  tags to individual files in different subsystems, we would be finished
  in about 10 years at the earliest.

  There will be more series of these types of patches coming over the
  next few weeks as the tools and reviewers crunch through the more
  "odd" variants of how to say "GPLv2" that developers have come up with
  over the years, combined with other fun oddities (GPL + a BSD
  disclaimer?) that are being unearthed, with the goal for the whole
  kernel to be cleaned up.

  These diffstats are not small, 3840 files are touched, over 10k lines
  removed in just 24 patches"

* tag 'spdx-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (24 commits)
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 25
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 23
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 22
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 21
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 20
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 19
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 18
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 17
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 15
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 14
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 12
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 11
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 10
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 9
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 7
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 4
  treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 3
  ...
2019-05-21 12:33:38 -07:00
Phil Sutter
e633508a95 netfilter: nft_fib: Fix existence check support
NFTA_FIB_F_PRESENT flag was not always honored since eval functions did
not call nft_fib_store_result in all cases.

Given that in all callsites there is a struct net_device pointer
available which holds the interface data to be stored in destination
register, simplify nft_fib_store_result() to just accept that pointer
instead of the nft_pktinfo pointer and interface index. This also
allows to drop the index to interface lookup previously needed to get
the name associated with given index.

Fixes: 055c4b34b9 ("netfilter: nft_fib: Support existence check")
Signed-off-by: Phil Sutter <phil@nwl.cc>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-05-21 16:10:38 +02:00
Thomas Gleixner
fd9871f70c treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 24
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or any
  later version this program is distributed in the hope that it will
  be useful but without any warranty without even the implied warranty
  of merchantability or fitness for a particular purpose see the gnu
  general public license for more details

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 50 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154042.917228456@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:52:39 +02:00
Thomas Gleixner
1ccea77e2a treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 13
Based on 2 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not see http www gnu org licenses

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details [based]
  [from] [clk] [highbank] [c] you should have received a copy of the
  gnu general public license along with this program if not see http
  www gnu org licenses

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 355 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.837383322@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:45 +02:00
Thomas Gleixner
0a65089ec5 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 5
Based on 1 normalized pattern(s):

  this program is free software you can redistribute it and or modify
  it under the terms of the gnu general public license as published by
  the free software foundation either version 2 of the license or at
  your option any later version this program is distributed in the
  hope that it will be useful but without any warranty without even
  the implied warranty of merchantability or fitness for a particular
  purpose see the gnu general public license for more details you
  should have received a copy of the gnu general public license along
  with this program if not see http www gnu org licenses the full gnu
  general public license is included in this distribution in the file
  called license

extracted by the scancode license scanner the SPDX license identifier

  GPL-2.0-or-later

has been chosen to replace the boilerplate/reference in 3 file(s).

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Steve Winslow <swinslow@gmail.com>
Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org>
Reviewed-by: Jilayne Lovejoy <opensource@jilayne.com>
Reviewed-by: Allison Randal <allison@lohutok.net>
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190519154041.052102771@linutronix.de
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
2019-05-21 11:28:40 +02:00
Linus Torvalds
78e0365184 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 1) Use after free in __dev_map_entry_free(), from Eric Dumazet.

 2) Fix TCP retransmission timestamps on passive Fast Open, from Yuchung
    Cheng.

 3) Orphan NFC, we'll take the patches directly into my tree. From
    Johannes Berg.

 4) We can't recycle cloned TCP skbs, from Eric Dumazet.

 5) Some flow dissector bpf test fixes, from Stanislav Fomichev.

 6) Fix RCU marking and warnings in rhashtable, from Herbert Xu.

 7) Fix some potential fib6 leaks, from Eric Dumazet.

 8) Fix a _decode_session4 uninitialized memory read bug fix that got
    lost in a merge. From Florian Westphal.

 9) Fix ipv6 source address routing wrt. exception route entries, from
    Wei Wang.

10) The netdev_xmit_more() conversion was not done %100 properly in mlx5
    driver, fix from Tariq Toukan.

11) Clean up botched merge on netfilter kselftest, from Florian
    Westphal.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (74 commits)
  of_net: fix of_get_mac_address retval if compiled without CONFIG_OF
  net: fix kernel-doc warnings for socket.c
  net: Treat sock->sk_drops as an unsigned int when printing
  kselftests: netfilter: fix leftover net/net-next merge conflict
  mlxsw: core: Prevent reading unsupported slave address from SFP EEPROM
  mlxsw: core: Prevent QSFP module initialization for old hardware
  vsock/virtio: Initialize core virtio vsock before registering the driver
  net/mlx5e: Fix possible modify header actions memory leak
  net/mlx5e: Fix no rewrite fields with the same match
  net/mlx5e: Additional check for flow destination comparison
  net/mlx5e: Add missing ethtool driver info for representors
  net/mlx5e: Fix number of vports for ingress ACL configuration
  net/mlx5e: Fix ethtool rxfh commands when CONFIG_MLX5_EN_RXNFC is disabled
  net/mlx5e: Fix wrong xmit_more application
  net/mlx5: Fix peer pf disable hca command
  net/mlx5: E-Switch, Correct type to u16 for vport_num and int for vport_index
  net/mlx5: Add meaningful return codes to status_to_err function
  net/mlx5: Imply MLXFW in mlx5_core
  Revert "tipc: fix modprobe tipc failed after switch order of device registration"
  vsock/virtio: free packets during the socket release
  ...
2019-05-20 08:21:07 -07:00
Linus Torvalds
227747fb9e AFS fixes
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAXN2BC/u3V2unywtrAQJm9w/+L7ufbRkj6XGVongmhf4n+auBQXMJ4jec
 zN6bjWrp/SN9kJfOqOKA+sk9s3cCOCV8SF/2eM5P8DJNtrB6aXlg590u1wSkOp99
 FdSM8Fy7v4bTwW9hCBhvcFpC+layVUEv/WAsCCIZi94W+H43XFY4QM79cqoqIx8r
 nTLu9EcjWFpUoBIAYEU0x/h4IA5Cyl6CUw3YZhZYaGoLLfi9EZkgBLlUU+6OXpDO
 Uepzn1gnpXMCNsiBE/Hr9LR0pfOTtzdJuNADrppRnbPfky8RsPE8tuk6kT6301U1
 IxG66SafYsvbQGzyIdfTydl022DFj5LOtCPFtfALviJqdBOGE/zPPnrBPinHg4oJ
 40P2tIJ/+Ksz5cPzmkA1KanSXaQ2v0sLBVdQJ7yt5EFuAMzj/roWpiPmEmQd6KqB
 ixZdZLehKFPaAB5cR41fHV1jB30HN7oakwqCoYmXd1Chu3AlB15yV9WZMSqjPS8P
 pkNC/X5mU5hDnZUx9e3Fbu8LqoGOjnGvDn5jOxihdKfaGu3A4OlbSerIUbRHvnT8
 u8XDPoq4j61f04MiI9z/bPDFTRYyycIQPcHYQpi4MJt9lSkkydP217P60BJsUv2n
 NIPYwgI7VIse0Gdo8shIg+RnSnJaKHT9Sf86h8pyDFO6wZp/GVVqPSdjjU+Lv5fv
 CZGJ7PCYcfs=
 =2q2Y
 -----END PGP SIGNATURE-----

Merge tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

Pull misc AFS fixes from David Howells:
 "This fixes a set of miscellaneous issues in the afs filesystem,
  including:

   - leak of keys on file close.

   - broken error handling in xattr functions.

   - missing locking when updating VL server list.

   - volume location server DNS lookup whereby preloaded cells may not
     ever get a lookup and regular DNS lookups to maintain server lists
     consume power unnecessarily.

   - incorrect error propagation and handling in the fileserver
     iteration code causes operations to sometimes apparently succeed.

   - interruption of server record check/update side op during
     fileserver iteration causes uninterruptible main operations to fail
     unexpectedly.

   - callback promise expiry time miscalculation.

   - over invalidation of the callback promise on directories.

   - double locking on callback break waking up file locking waiters.

   - double increment of the vnode callback break counter.

  Note that it makes some changes outside of the afs code, including:

   - an extra parameter to dns_query() to allow the dns_resolver key
     just accessed to be immediately invalidated. AFS is caching the
     results itself, so the key can be discarded.

   - an interruptible version of wait_var_event().

   - an rxrpc function to allow the maximum lifespan to be set on a
     call.

   - a way for an rxrpc call to be marked as non-interruptible"

* tag 'afs-fixes-20190516' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
  afs: Fix double inc of vnode->cb_break
  afs: Fix lock-wait/callback-break double locking
  afs: Don't invalidate callback if AFS_VNODE_DIR_VALID not set
  afs: Fix calculation of callback expiry time
  afs: Make dynamic root population wait uninterruptibly for proc_cells_lock
  afs: Make some RPC operations non-interruptible
  rxrpc: Allow the kernel to mark a call as being non-interruptible
  afs: Fix error propagation from server record check/update
  afs: Fix the maximum lifespan of VL and probe calls
  rxrpc: Provide kernel interface to set max lifespan on a call
  afs: Fix "kAFS: AFS vnode with undefined type 0"
  afs: Fix cell DNS lookup
  Add wait_var_event_interruptible()
  dns_resolver: Allow used keys to be invalidated
  afs: Fix afs_cell records to always have a VL server list record
  afs: Fix missing lock when replacing VL server list
  afs: Fix afs_xattr_get_yfs() to not try freeing an error value
  afs: Fix incorrect error handling in afs_xattr_get_acl()
  afs: Fix key leak in afs_release() and afs_evict_inode()
2019-05-16 17:00:13 -07:00
Eric Dumazet
61fb0d0168 ipv6: prevent possible fib6 leaks
At ipv6 route dismantle, fib6_drop_pcpu_from() is responsible
for finding all percpu routes and set their ->from pointer
to NULL, so that fib6_ref can reach its expected value (1).

The problem right now is that other cpus can still catch the
route being deleted, since there is no rcu grace period
between the route deletion and call to fib6_drop_pcpu_from()

This can leak the fib6 and associated resources, since no
notifier will take care of removing the last reference(s).

I decided to add another boolean (fib6_destroying) instead
of reusing/renaming exception_bucket_flushed to ease stable backports,
and properly document the memory barriers used to implement this fix.

This patch has been co-developped with Wei Wang.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Wei Wang <weiwan@google.com>
Cc: David Ahern <dsahern@gmail.com>
Cc: Martin Lau <kafai@fb.com>
Acked-by: Wei Wang <weiwan@google.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:21:00 -07:00
Edward Cree
bae9ed6902 flow_offload: support CVLAN match
Plumb it through from the flow_dissector.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-16 12:02:42 -07:00
David Howells
b960a34b73 rxrpc: Allow the kernel to mark a call as being non-interruptible
Allow kernel services using AF_RXRPC to indicate that a call should be
non-interruptible.  This allows kafs to make things like lock-extension and
writeback data storage calls non-interruptible.

If this is set, signals will be ignored for operations on that call where
possible - such as waiting to get a call channel on an rxrpc connection.

It doesn't prevent UDP sendmsg from being interrupted, but that will be
handled by packet retransmission.

rxrpc_kernel_recv_data() isn't affected by this since that never waits,
preferring instead to return -EAGAIN and leave the waiting to the caller.

Userspace initiated calls can't be set to be uninterruptible at this time.

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16 16:25:20 +01:00
David Howells
bbd172e316 rxrpc: Provide kernel interface to set max lifespan on a call
Provide an interface to set max lifespan on a call from inside of the
kernel without having to call kernel_sendmsg().

Signed-off-by: David Howells <dhowells@redhat.com>
2019-05-16 15:48:30 +01:00
Eric Dumazet
858f501744 tcp: do not recycle cloned skbs
It is illegal to change arbitrary fields in skb_shared_info if the
skb is cloned.

Before calling skb_zcopy_clear() we need to ensure this rule,
therefore we need to move the test from sk_stream_alloc_skb()
to sk_wmem_free_skb()

Fixes: 4f661542a4 ("tcp: fix zerocopy and notsent_lowat issues")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Diagnosed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-15 09:22:41 -07:00
Vladimir Oltean
1c9b1420ac net: dsa: Remove the now unused DSA_SKB_CB_COPY() macro
It's best to not expose this, due to the performance hit it may cause
when calling it.

Fixes: b68b0dd0fb ("net: dsa: Keep private info in the skb->cb")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-12 13:19:46 -07:00
Vladimir Oltean
506f0e09ce net: dsa: Remove dangerous DSA_SKB_CLONE() macro
This does not cause any bug now because it has no users, but its body
contains two pointer definitions within a code block:

		struct sk_buff *clone = _clone;	\
		struct sk_buff *skb = _skb;	\

When calling the macro as DSA_SKB_CLONE(clone, skb), these variables
would obscure the arguments that the macro was called with, and the
initializers would be a no-op instead of doing their job (undefined
behavior, by the way, but GCC nicely puts NULL pointers instead).

So simply remove this broken macro and leave users to simply call
"DSA_SKB_CB(skb)->clone = clone" by hand when needed.

There is one functional difference when doing what I just suggested
above: the control block won't be transferred from the original skb into
the clone. Since there's no foreseen need for the control block in the
clone ATM, this is ok.

Fixes: b68b0dd0fb ("net: dsa: Keep private info in the skb->cb")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-12 13:19:46 -07:00
Vladimir Oltean
8767137510 net: dsa: Initialize DSA_SKB_CB(skb)->deferred_xmit variable
The sk_buff control block can have any contents on xmit put there by the
stack, so initialization is mandatory, since we are checking its value
after the actual DSA xmit (the tagger may have changed it).

The DSA_SKB_ZERO() macro could have been used for this purpose, but:
- Zeroizing a 48-byte memory region in the hotpath is best avoided.
- It would have triggered a warning with newer compilers since
  __dsa_skb_cb contains a structure within a structure, and the {0}
  initializer was incorrect for that purpose.

So simply remove the DSA_SKB_ZERO() macro and initialize the
deferred_xmit variable by hand (which should be done for all further
dsa_skb_cb variables which need initialization - currently none - to
avoid the performance penalty).

Fixes: 97a69a0dea ("net: dsa: Add support for deferred xmit")
Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-12 13:19:46 -07:00
Jakub Kicinski
494bc1d281 net/tcp: use deferred jump label for TCP acked data hook
User space can flip the clean_acked_data_enabled static branch
on and off with TLS offload when CONFIG_TLS_DEVICE is enabled.
jump_label.h suggests we use the delayed version in this case.

Deferred branches now also don't take the branch mutex on
decrement, so we avoid potential locking issues.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-09 11:13:57 -07:00
Linus Torvalds
80f232121b Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   1) Support AES128-CCM ciphers in kTLS, from Vakul Garg.

   2) Add fib_sync_mem to control the amount of dirty memory we allow to
      queue up between synchronize RCU calls, from David Ahern.

   3) Make flow classifier more lockless, from Vlad Buslov.

   4) Add PHY downshift support to aquantia driver, from Heiner
      Kallweit.

   5) Add SKB cache for TCP rx and tx, from Eric Dumazet. This reduces
      contention on SLAB spinlocks in heavy RPC workloads.

   6) Partial GSO offload support in XFRM, from Boris Pismenny.

   7) Add fast link down support to ethtool, from Heiner Kallweit.

   8) Use siphash for IP ID generator, from Eric Dumazet.

   9) Pull nexthops even further out from ipv4/ipv6 routes and FIB
      entries, from David Ahern.

  10) Move skb->xmit_more into a per-cpu variable, from Florian
      Westphal.

  11) Improve eBPF verifier speed and increase maximum program size,
      from Alexei Starovoitov.

  12) Eliminate per-bucket spinlocks in rhashtable, and instead use bit
      spinlocks. From Neil Brown.

  13) Allow tunneling with GUE encap in ipvs, from Jacky Hu.

  14) Improve link partner cap detection in generic PHY code, from
      Heiner Kallweit.

  15) Add layer 2 encap support to bpf_skb_adjust_room(), from Alan
      Maguire.

  16) Remove SKB list implementation assumptions in SCTP, your's truly.

  17) Various cleanups, optimizations, and simplifications in r8169
      driver. From Heiner Kallweit.

  18) Add memory accounting on TX and RX path of SCTP, from Xin Long.

  19) Switch PHY drivers over to use dynamic featue detection, from
      Heiner Kallweit.

  20) Support flow steering without masking in dpaa2-eth, from Ioana
      Ciocoi.

  21) Implement ndo_get_devlink_port in netdevsim driver, from Jiri
      Pirko.

  22) Increase the strict parsing of current and future netlink
      attributes, also export such policies to userspace. From Johannes
      Berg.

  23) Allow DSA tag drivers to be modular, from Andrew Lunn.

  24) Remove legacy DSA probing support, also from Andrew Lunn.

  25) Allow ll_temac driver to be used on non-x86 platforms, from Esben
      Haabendal.

  26) Add a generic tracepoint for TX queue timeouts to ease debugging,
      from Cong Wang.

  27) More indirect call optimizations, from Paolo Abeni"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1763 commits)
  cxgb4: Fix error path in cxgb4_init_module
  net: phy: improve pause mode reporting in phy_print_status
  dt-bindings: net: Fix a typo in the phy-mode list for ethernet bindings
  net: macb: Change interrupt and napi enable order in open
  net: ll_temac: Improve error message on error IRQ
  net/sched: remove block pointer from common offload structure
  net: ethernet: support of_get_mac_address new ERR_PTR error
  net: usb: smsc: fix warning reported by kbuild test robot
  staging: octeon-ethernet: Fix of_get_mac_address ERR_PTR check
  net: dsa: support of_get_mac_address new ERR_PTR error
  net: dsa: sja1105: Fix status initialization in sja1105_get_ethtool_stats
  vrf: sit mtu should not be updated when vrf netdev is the link
  net: dsa: Fix error cleanup path in dsa_init_module
  l2tp: Fix possible NULL pointer dereference
  taprio: add null check on sched_nest to avoid potential null pointer dereference
  net: mvpp2: cls: fix less than zero check on a u32 variable
  net_sched: sch_fq: handle non connected flows
  net_sched: sch_fq: do not assume EDT packets are ordered
  net: hns3: use devm_kcalloc when allocating desc_cb
  net: hns3: some cleanup for struct hns3_enet_ring
  ...
2019-05-07 22:03:58 -07:00
David S. Miller
a9e41a5296 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor conflict with the DSA legacy code removal.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 17:22:09 -07:00
Pieter Jansen van Vuuren
d6787147e1 net/sched: remove block pointer from common offload structure
Based on feedback from Jiri avoid carrying a pointer to the tcf_block
structure in the tc_cls_common_offload structure. Instead store
a flag in driver private data which indicates if offloads apply
to a shared block at block binding time.

Suggested-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-07 12:23:40 -07:00
Linus Torvalds
a0e928ed7c Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer updates from Ingo Molnar:
 "This cycle had the following changes:

   - Timer tracing improvements (Anna-Maria Gleixner)

   - Continued tasklet reduction work: remove the hrtimer_tasklet
     (Thomas Gleixner)

   - Fix CPU hotplug remove race in the tick-broadcast mask handling
     code (Thomas Gleixner)

   - Force upper bound for setting CLOCK_REALTIME, to fix ABI
     inconsistencies with handling values that are close to the maximum
     supported and the vagueness of when uptime related wraparound might
     occur. Make the consistent maximum the year 2232 across all
     relevant ABIs and APIs. (Thomas Gleixner)

   - various cleanups and smaller fixes"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  tick: Fix typos in comments
  tick/broadcast: Fix warning about undefined tick_broadcast_oneshot_offline()
  timekeeping: Force upper bound for setting CLOCK_REALTIME
  timer/trace: Improve timer tracing
  timer/trace: Replace deprecated vsprintf pointer extension %pf by %ps
  timer: Move trace point to get proper index
  tick/sched: Update tick_sched struct documentation
  tick: Remove outgoing CPU from broadcast masks
  timekeeping: Consistently use unsigned int for seqcount snapshot
  softirq: Remove tasklet_hrtimer
  xfrm: Replace hrtimer tasklet with softirq hrtimer
  mac80211_hwsim: Replace hrtimer tasklet with softirq hrtimer
2019-05-06 14:50:46 -07:00
Vladimir Oltean
227d07a07e net: dsa: sja1105: Add support for traffic through standalone ports
In order to support this, we are creating a make-shift switch tag out of
a VLAN trunk configured on the CPU port. Termination of normal traffic
on switch ports only works when not under a vlan_filtering bridge.
Termination of management (PTP, BPDU) traffic works under all
circumstances because it uses a different tagging mechanism
(incl_srcpt). We are making use of the generic CONFIG_NET_DSA_TAG_8021Q
code and leveraging it from our own CONFIG_NET_DSA_TAG_SJA1105.

There are two types of traffic: regular and link-local.

The link-local traffic received on the CPU port is trapped from the
switch's regular forwarding decisions because it matched one of the two
DMAC filters for management traffic.

On transmission, the switch requires special massaging for these
link-local frames. Due to a weird implementation of the switching IP, by
default it drops link-local frames that originate on the CPU port.
It needs to be told where to forward them to, through an SPI command
("management route") that is valid for only a single frame.
So when we're sending link-local traffic, we are using the
dsa_defer_xmit mechanism.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
c362beb072 net: dsa: Add a private structure pointer to dsa_port
This is supposed to share information between the driver and the tagger,
or used by the tagger to keep some state. Its use is optional.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
97a69a0dea net: dsa: Add support for deferred xmit
Some hardware needs to take work to get convinced to receive frames on
the CPU port (such as the sja1105 which takes temporary L2 forwarding
rules over SPI that last for a single frame). Such work needs a
sleepable context, and because the regular .ndo_start_xmit is atomic,
this cannot be done in the tagger. So introduce a generic DSA mechanism
that sets up a transmit skb queue and a workqueue for deferred
transmission.

The new driver callback (.port_deferred_xmit) is in dsa_switch and not
in the tagger because the operations that require sleeping typically
also involve interacting with the hardware, and not simply skb
manipulations. Therefore having it there simplifies the structure a bit
and makes it unnecessary to export functions from the driver to the
tagger.

The driver is responsible of calling dsa_enqueue_skb which transfers it
to the master netdevice. This is so that it has a chance of performing
some more work afterwards, such as cleanup or TX timestamping.

To tell DSA that skb xmit deferral is required, I have thought about
changing the return type of the tagger .xmit from struct sk_buff * into
a enum dsa_tx_t that could potentially encode a DSA_XMIT_DEFER value.

But the trailer tagger is reallocating every skb on xmit and therefore
making a valid use of the pointer return value. So instead of reworking
the API in complicated ways, right now a boolean property in the newly
introduced DSA_SKB_CB is set.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
b68b0dd0fb net: dsa: Keep private info in the skb->cb
Map a DSA structure over the 48-byte control block that will hold
skb info on transmit and receive. This is only for use within the DSA
processing layer (e.g. communicating between DSA core and tagger) and
not for passing info around with other layers such as the master net
device.

Also add a DSA_SKB_CB_PRIV() macro which retrieves a pointer to the
space up to 48 bytes that the DSA structure does not use. This space can
be used for drivers to add their own private info.

One use is for the PTP timestamping code path. When cloning a skb,
annotate the original with a pointer to the clone, which the driver can
then find easily and place the timestamp to. This avoids the need of a
separate queue to hold clones and a way to match an original to a cloned
skb.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
cc1939e4b3 net: dsa: Allow drivers to filter packets they can decode source port from
Frames get processed by DSA and redirected to switch port net devices
based on the ETH_P_XDSA multiplexed packet_type handler found by the
network stack when calling eth_type_trans().

The running assumption is that once the DSA .rcv function is called, DSA
is always able to decode the switch tag in order to change the skb->dev
from its master.

However there are tagging protocols (such as the new DSA_TAG_PROTO_SJA1105,
user of DSA_TAG_PROTO_8021Q) where this assumption is not completely
true, since switch tagging piggybacks on the absence of a vlan_filtering
bridge. Moreover, management traffic (BPDU, PTP) for this switch doesn't
rely on switch tagging, but on a different mechanism. So it would make
sense to at least be able to terminate that.

Having DSA receive traffic it can't decode would put it in an impossible
situation: the eth_type_trans() function would invoke the DSA .rcv(),
which could not change skb->dev, then eth_type_trans() would be invoked
again, which again would call the DSA .rcv, and the packet would never
be able to exit the DSA filter and would spiral in a loop until the
whole system dies.

This happens because eth_type_trans() doesn't actually look at the skb
(so as to identify a potential tag) when it deems it as being
ETH_P_XDSA. It just checks whether skb->dev has a DSA private pointer
installed (therefore it's a DSA master) and that there exists a .rcv
callback (everybody except DSA_TAG_PROTO_NONE has that). This is
understandable as there are many switch tags out there, and exhaustively
checking for all of them is far from ideal.

The solution lies in introducing a filtering function for each tagging
protocol. In the absence of a filtering function, all traffic is passed
to the .rcv DSA callback. The tagging protocol should see the filtering
function as a pre-validation that it can decode the incoming skb. The
traffic that doesn't match the filter will bypass the DSA .rcv callback
and be left on the master netdevice, which wasn't previously possible.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Vladimir Oltean
f9bbe4477c net: dsa: Optional VLAN-based port separation for switches without tagging
This patch provides generic DSA code for using VLAN (802.1Q) tags for
the same purpose as a dedicated switch tag for injection/extraction.
It is based on the discussions and interest that has been so far
expressed in https://www.spinics.net/lists/netdev/msg556125.html.

Unlike all other DSA-supported tagging protocols, CONFIG_NET_DSA_TAG_8021Q
does not offer a complete solution for drivers (nor can it). Instead, it
provides generic code that driver can opt into calling:
- dsa_8021q_xmit: Inserts a VLAN header with the specified contents.
  Can be called from another tagging protocol's xmit function.
  Currently the LAN9303 driver is inserting headers that are simply
  802.1Q with custom fields, so this is an opportunity for code reuse.
- dsa_8021q_rcv: Retrieves the TPID and TCI from a VLAN-tagged skb.
  Removing the VLAN header is left as a decision for the caller to make.
- dsa_port_setup_8021q_tagging: For each user port, installs an Rx VID
  and a Tx VID, for proper untagged traffic identification on ingress
  and steering on egress. Also sets up the VLAN trunk on the upstream
  (CPU or DSA) port. Drivers are intentionally left to call this
  function explicitly, depending on the context and hardware support.
  The expected switch behavior and VLAN semantics should not be violated
  under any conditions. That is, after calling
  dsa_port_setup_8021q_tagging, the hardware should still pass all
  ingress traffic, be it tagged or untagged.

For uniformity with the other tagging protocols, a module for the
dsa_8021q_netdev_ops structure is registered, but the typical usage is
to set up another tagging protocol which selects CONFIG_NET_DSA_TAG_8021Q,
and calls the API from tag_8021q.h. Null function definitions are also
provided so that a "depends on" is not forced in the Kconfig.

This tagging protocol only works when switch ports are standalone, or
when they are added to a VLAN-unaware bridge. It will probably remain
this way for the reasons below.

When added to a bridge that has vlan_filtering 1, the bridge core will
install its own VLANs and reset the pvids through switchdev. For the
bridge core, switchdev is a write-only pipe. All VLAN-related state is
kept in the bridge core and nothing is read from DSA/switchdev or from
the driver. So the bridge core will break this port separation because
it will install the vlan_default_pvid into all switchdev ports.

Even if we could teach the bridge driver about switchdev preference of a
certain vlan_default_pvid (task difficult in itself since the current
setting is per-bridge but we would need it per-port), there would still
exist many other challenges.

Firstly, in the DSA rcv callback, a driver would have to perform an
iterative reverse lookup to find the correct switch port. That is
because the port is a bridge slave, so its Rx VID (port PVID) is subject
to user configuration. How would we ensure that the user doesn't reset
the pvid to a different value (which would make an O(1) translation
impossible), or to a non-unique value within this DSA switch tree (which
would make any translation impossible)?

Finally, not all switch ports are equal in DSA, and that makes it
difficult for the bridge to be completely aware of this anyway.
The CPU port needs to transmit tagged packets (VLAN trunk) in order for
the DSA rcv code to be able to decode source information.
But the bridge code has absolutely no idea which switch port is the CPU
port, if nothing else then just because there is no netdevice registered
by DSA for the CPU port.
Also DSA does not currently allow the user to specify that they want the
CPU port to do VLAN trunking anyway. VLANs are added to the CPU port
using the same flags as they were added on the user port.

So the VLANs installed by dsa_port_setup_8021q_tagging per driver
request should remain private from the bridge's and user's perspective,
and should not alter the VLAN semantics observed by the user.

In the current implementation a VLAN range ending at 4095 (VLAN_N_VID)
is reserved for this purpose. Each port receives a unique Rx VLAN and a
unique Tx VLAN. Separate VLANs are needed for Rx and Tx because they
serve different purposes: on Rx the switch must process traffic as
untagged and process it with a port-based VLAN, but with care not to
hinder bridging. On the other hand, the Tx VLAN is where the
reachability restrictions are imposed, since by tagging frames in the
xmit callback we are telling the switch onto which port to steer the
frame.

Some general guidance on how this support might be employed for
real-life hardware (some comments made by Florian Fainelli):

- If the hardware supports VLAN tag stacking, it should somehow back
  up its private VLAN settings when the bridge tries to override them.
  Then the driver could re-apply them as outer tags. Dedicating an outer
  tag per bridge device would allow identical inner tag VID numbers to
  co-exist, yet preserve broadcast domain isolation.

- If the switch cannot handle VLAN tag stacking, it should disable this
  port separation when added as slave to a vlan_filtering bridge, in
  that case having reduced functionality.

- Drivers for old switches that don't support the entire VLAN_N_VID
  range will need to rework the current range selection mechanism.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:52:42 -07:00
Pieter Jansen van Vuuren
88c44a5200 net/sched: add block pointer to tc_cls_common_offload structure
Some actions like the police action are stateful and could share state
between devices. This is incompatible with offloading to multiple devices
and drivers might want to test for shared blocks when offloading.
Store a pointer to the tcf_block structure in the tc_cls_common_offload
structure to allow drivers to determine when offloads apply to a shared
block.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
b7fe4ab8a6 net/sched: extend matchall offload for hardware statistics
Introduce a new command for matchall classifiers that allows hardware
to update statistics.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
8c8cfc6ed2 net/sched: add police action to the hardware intermediate representation
Add police action to the hardware intermediate representation which
would subsequently allow it to be used by drivers for offload.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
fa762da94d net/sched: move police action structures to header
Move tcf_police_params, tcf_police and tc_police_compat structures to a
header. Making them usable to other code for example drivers that would
offload police actions to hardware.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
dfcb19f0fa net/sched: remove unused functions for matchall offload
Cleanup unused functions and variables after porting to the newer
intermediate representation.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:24 -07:00
Pieter Jansen van Vuuren
ab79af32b0 mlxsw: use intermediate representation for matchall offload
Updates the Mellanox spectrum driver to use the newer intermediate
representation for flow actions in matchall offloads.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:23 -07:00
Pieter Jansen van Vuuren
f00cbf1968 net/sched: use the hardware intermediate representation for matchall
Extends matchall offload to make use of the hardware intermediate
representation. More specifically, this patch moves the native TC
actions in cls_matchall offload to the newer flow_action
representation. This ultimately allows us to avoid a direct
dependency on native TC actions for matchall.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:23 -07:00
Pieter Jansen van Vuuren
a7a7be6087 net/sched: add sample action to the hardware intermediate representation
Add sample action to the hardware intermediate representation model which
would subsequently allow it to be used by drivers for offload.

Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:49:23 -07:00
David S. Miller
1ffad6d1af Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

===================
Netfilter updates for net-next

The following batch contains Netfilter updates for net-next, they are:

1) Move nft_expr_clone() to nft_dynset, from Paul Gortmaker.

2) Do not include module.h from net/netfilter/nf_tables.h,
   also from Paul.

3) Restrict conntrack sysctl entries to boolean, from Tonghao Zhang.

4) Several patches to add infrastructure to autoload NAT helper
   modules from their respective conntrack helper, this also includes
   the first client of this code in OVS, patches from Flavio Leitner.

5) Add support to match for conntrack ID, from Brett Mastbergen.

6) Spelling fix in connlabel, from Colin Ian King.

7) Use struct_size() from hashlimit, from Gustavo A. R. Silva.

8) Add optimized version of nf_inet_addr_mask(), from Li RongQing.
===================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 21:35:08 -07:00
David S. Miller
19ab5f4023 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2019-05-05

Here's one more bluetooth-next pull request for 5.2:

 - Fixed Command Complete event handling check for matching opcode
 - Added support for Qualcomm WCN3998 controller, along with DT bindings
 - Added default address for Broadcom BCM2076B1 controllers

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 13:10:36 -07:00
David Ahern
9b3040a6aa ipv4: Define __ipv4_neigh_lookup_noref when CONFIG_INET is disabled
Define __ipv4_neigh_lookup_noref to return NULL when CONFIG_INET is disabled.

Fixes: 4b2a2bfeb3 ("neighbor: Call __ipv4_neigh_lookup_noref in neigh_xmit")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 11:25:49 -07:00
Eric Dumazet
47d3d7fdb1 ip6: fix skb leak in ip6frag_expire_frag_queue()
Since ip6frag_expire_frag_queue() now pulls the head skb
from frag queue, we should no longer use skb_get(), since
this leads to an skb leak.

Stefan Bader initially reported a problem in 4.4.stable [1] caused
by the skb_get(), so this patch should also fix this issue.

296583.091021] kernel BUG at /build/linux-6VmqmP/linux-4.4.0/net/core/skbuff.c:1207!
[296583.091734] Call Trace:
[296583.091749]  [<ffffffff81740e50>] __pskb_pull_tail+0x50/0x350
[296583.091764]  [<ffffffff8183939a>] _decode_session6+0x26a/0x400
[296583.091779]  [<ffffffff817ec719>] __xfrm_decode_session+0x39/0x50
[296583.091795]  [<ffffffff818239d0>] icmpv6_route_lookup+0xf0/0x1c0
[296583.091809]  [<ffffffff81824421>] icmp6_send+0x5e1/0x940
[296583.091823]  [<ffffffff81753238>] ? __netif_receive_skb+0x18/0x60
[296583.091838]  [<ffffffff817532b2>] ? netif_receive_skb_internal+0x32/0xa0
[296583.091858]  [<ffffffffc0199f74>] ? ixgbe_clean_rx_irq+0x594/0xac0 [ixgbe]
[296583.091876]  [<ffffffffc04eb260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[296583.091893]  [<ffffffff8183d431>] icmpv6_send+0x21/0x30
[296583.091906]  [<ffffffff8182b500>] ip6_expire_frag_queue+0xe0/0x120
[296583.091921]  [<ffffffffc04eb27f>] nf_ct_frag6_expire+0x1f/0x30 [nf_defrag_ipv6]
[296583.091938]  [<ffffffff810f3b57>] call_timer_fn+0x37/0x140
[296583.091951]  [<ffffffffc04eb260>] ? nf_ct_net_exit+0x50/0x50 [nf_defrag_ipv6]
[296583.091968]  [<ffffffff810f5464>] run_timer_softirq+0x234/0x330
[296583.091982]  [<ffffffff8108a339>] __do_softirq+0x109/0x2b0

Fixes: d4289fcc9b ("net: IP6 defrag: use rbtrees for IPv6 defrag")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Stefan Bader <stefan.bader@canonical.com>
Cc: Peter Oskolkov <posk@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 10:40:24 -07:00
João Paulo Rechi Vita
f80c5dad7b Bluetooth: Ignore CC events not matching the last HCI command
This commit makes the kernel not send the next queued HCI command until
a command complete arrives for the last HCI command sent to the
controller. This change avoids a problem with some buggy controllers
(seen on two SKUs of QCA9377) that send an extra command complete event
for the previous command after the kernel had already sent a new HCI
command to the controller.

The problem was reproduced when starting an active scanning procedure,
where an extra command complete event arrives for the LE_SET_RANDOM_ADDR
command. When this happends the kernel ends up not processing the
command complete for the following commmand, LE_SET_SCAN_PARAM, and
ultimately behaving as if a passive scanning procedure was being
performed, when in fact controller is performing an active scanning
procedure. This makes it impossible to discover BLE devices as no device
found events are sent to userspace.

This problem is reproducible on 100% of the attempts on the affected
controllers. The extra command complete event can be seen at timestamp
27.420131 on the btmon logs bellow.

Bluetooth monitor ver 5.50
= Note: Linux version 5.0.0+ (x86_64)                                  0.352340
= Note: Bluetooth subsystem version 2.22                               0.352343
= New Index: 80:C5:F2:8F:87:84 (Primary,USB,hci0)               [hci0] 0.352344
= Open Index: 80:C5:F2:8F:87:84                                 [hci0] 0.352345
= Index Info: 80:C5:F2:8F:87:84 (Qualcomm)                      [hci0] 0.352346
@ MGMT Open: bluetoothd (privileged) version 1.14             {0x0001} 0.352347
@ MGMT Open: btmon (privileged) version 1.14                  {0x0002} 0.352366
@ MGMT Open: btmgmt (privileged) version 1.14                {0x0003} 27.302164
@ MGMT Command: Start Discovery (0x0023) plen 1       {0x0003} [hci0] 27.302310
        Address type: 0x06
          LE Public
          LE Random
< HCI Command: LE Set Random Address (0x08|0x0005) plen 6   #1 [hci0] 27.302496
        Address: 15:60:F2:91:B2:24 (Non-Resolvable)
> HCI Event: Command Complete (0x0e) plen 4                 #2 [hci0] 27.419117
      LE Set Random Address (0x08|0x0005) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Parameters (0x08|0x000b) plen 7  #3 [hci0] 27.419244
        Type: Active (0x01)
        Interval: 11.250 msec (0x0012)
        Window: 11.250 msec (0x0012)
        Own address type: Random (0x01)
        Filter policy: Accept all advertisement (0x00)
> HCI Event: Command Complete (0x0e) plen 4                 #4 [hci0] 27.420131
      LE Set Random Address (0x08|0x0005) ncmd 1
        Status: Success (0x00)
< HCI Command: LE Set Scan Enable (0x08|0x000c) plen 2      #5 [hci0] 27.420259
        Scanning: Enabled (0x01)
        Filter duplicates: Enabled (0x01)
> HCI Event: Command Complete (0x0e) plen 4                 #6 [hci0] 27.420969
      LE Set Scan Parameters (0x08|0x000b) ncmd 1
        Status: Success (0x00)
> HCI Event: Command Complete (0x0e) plen 4                 #7 [hci0] 27.421983
      LE Set Scan Enable (0x08|0x000c) ncmd 1
        Status: Success (0x00)
@ MGMT Event: Command Complete (0x0001) plen 4        {0x0003} [hci0] 27.422059
      Start Discovery (0x0023) plen 1
        Status: Success (0x00)
        Address type: 0x06
          LE Public
          LE Random
@ MGMT Event: Discovering (0x0013) plen 2             {0x0003} [hci0] 27.422067
        Address type: 0x06
          LE Public
          LE Random
        Discovery: Enabled (0x01)
@ MGMT Event: Discovering (0x0013) plen 2             {0x0002} [hci0] 27.422067
        Address type: 0x06
          LE Public
          LE Random
        Discovery: Enabled (0x01)
@ MGMT Event: Discovering (0x0013) plen 2             {0x0001} [hci0] 27.422067
        Address type: 0x06
          LE Public
          LE Random
        Discovery: Enabled (0x01)

Signed-off-by: João Paulo Rechi Vita <jprvita@endlessm.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-05-05 19:29:04 +02:00
David Ahern
a5995e7107 ipv4: Move exception bucket to nh_common
Similar to the cached routes, make IPv4 exceptions accessible when
using an IPv6 nexthop struct with IPv4 routes. Simplify the exception
functions by passing in fib_nh_common since that is all it needs,
and then cleanup the call sites that have extraneous fib_nh conversions.

As with the cached routes this is a change in location only, from fib_nh
up to fib_nh_common; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 00:47:16 -07:00
David Ahern
0f457a3662 ipv4: Move cached routes to fib_nh_common
While the cached routes, nh_pcpu_rth_output and nh_rth_input, are IPv4
specific, a later patch wants to make them accessible for IPv6 nexthops
with IPv4 routes using a fib6_nh. Move the cached routes from fib_nh to
fib_nh_common and update references.

Initialization of the cached entries is moved to fib_nh_common_init,
and free is moved to fib_nh_common_release.

Change in location only, from fib_nh up to fib_nh_common; no functional
change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-05 00:47:16 -07:00
Michal Kubecek
b424e432e7 netlink: add validation of NLA_F_NESTED flag
Add new validation flag NL_VALIDATE_NESTED which adds three consistency
checks of NLA_F_NESTED_FLAG:

  - the flag is set on attributes with NLA_NESTED{,_ARRAY} policy
  - the flag is not set on attributes with other policies except NLA_UNSPEC
  - the flag is set on attribute passed to nla_parse_nested()

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>

v2: change error messages to mention NLA_F_NESTED explicitly
Reviewed-by: Johannes Berg <johannes@sipsolutions.net>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-04 01:27:11 -04:00
David S. Miller
ff24e4980a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three trivial overlapping conflicts.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-02 22:14:21 -04:00
Moshe Shemesh
b587bdaf5f devlink: Change devlink health locking mechanism
The devlink health reporters create/destroy and user commands currently
use the devlink->lock as a locking mechanism. Different reporters have
different rules in the driver and are being created/destroyed during
different stages of driver load/unload/running. So during execution of a
reporter recover the flow can go through another reporter's destroy and
create. Such flow leads to deadlock trying to lock a mutex already
held.

With the new locking mechanism the different reporters share mutex lock
only to protect access to shared reporters list.
Added refcount per reporter, to protect the reporters from destroy while
being used.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 11:07:03 -04:00
Xin Long
fbd019737d sctp: avoid running the sctp state machine recursively
Ying triggered a call trace when doing an asconf testing:

  BUG: scheduling while atomic: swapper/12/0/0x10000100
  Call Trace:
   <IRQ>  [<ffffffffa4375904>] dump_stack+0x19/0x1b
   [<ffffffffa436fcaf>] __schedule_bug+0x64/0x72
   [<ffffffffa437b93a>] __schedule+0x9ba/0xa00
   [<ffffffffa3cd5326>] __cond_resched+0x26/0x30
   [<ffffffffa437bc4a>] _cond_resched+0x3a/0x50
   [<ffffffffa3e22be8>] kmem_cache_alloc_node+0x38/0x200
   [<ffffffffa423512d>] __alloc_skb+0x5d/0x2d0
   [<ffffffffc0995320>] sctp_packet_transmit+0x610/0xa20 [sctp]
   [<ffffffffc098510e>] sctp_outq_flush+0x2ce/0xc00 [sctp]
   [<ffffffffc098646c>] sctp_outq_uncork+0x1c/0x20 [sctp]
   [<ffffffffc0977338>] sctp_cmd_interpreter.isra.22+0xc8/0x1460 [sctp]
   [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp]
   [<ffffffffc099443d>] sctp_primitive_ASCONF+0x3d/0x50 [sctp]
   [<ffffffffc0977384>] sctp_cmd_interpreter.isra.22+0x114/0x1460 [sctp]
   [<ffffffffc0976ad1>] sctp_do_sm+0xe1/0x350 [sctp]
   [<ffffffffc097b3a4>] sctp_assoc_bh_rcv+0xf4/0x1b0 [sctp]
   [<ffffffffc09840f1>] sctp_inq_push+0x51/0x70 [sctp]
   [<ffffffffc099732b>] sctp_rcv+0xa8b/0xbd0 [sctp]

As it shows, the first sctp_do_sm() running under atomic context (NET_RX
softirq) invoked sctp_primitive_ASCONF() that uses GFP_KERNEL flag later,
and this flag is supposed to be used in non-atomic context only. Besides,
sctp_do_sm() was called recursively, which is not expected.

Vlad tried to fix this recursive call in Commit c078669340 ("sctp: Fix
oops when sending queued ASCONF chunks") by introducing a new command
SCTP_CMD_SEND_NEXT_ASCONF. But it didn't work as this command is still
used in the first sctp_do_sm() call, and sctp_primitive_ASCONF() will
be called in this command again.

To avoid calling sctp_do_sm() recursively, we send the next queued ASCONF
not by sctp_primitive_ASCONF(), but by sctp_sf_do_prm_asconf() in the 1st
sctp_do_sm() directly.

Reported-by: Ying Xu <yinxu@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-05-01 09:18:57 -04:00
Andrew Lunn
93e86b3bc8 net: dsa: Remove legacy probing support
Now that all drivers can be probed using more traditional methods,
remove the legacy probe code.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:15:35 -04:00
Vladimir Oltean
cf2d45f5ba net: dsa: Add helper function to retrieve VLAN awareness setting
Since different types of hardware may or may not support this setting
per-port, DSA keeps it either in dsa_switch or in dsa_port.

While drivers may know the characteristics of their hardware and
retrieve it from the correct place without the need of helpers, it is
cumbersone to find out an unambigous answer from generic DSA code.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:29 -04:00
Vladimir Oltean
145746765f net: dsa: Keep the vlan_filtering setting in dsa_switch if it's global
The current behavior is not as obvious as one would assume (which is
that, if the driver set vlan_filtering_is_global = 1, then checking any
dp->vlan_filtering would yield the same result). Only the ports which
are actively enslaved into a bridge would have vlan_filtering set.

This makes it tricky for drivers to check what the global state is.
So fix this and make the struct dsa_switch hold this global setting.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:29 -04:00
Vladimir Oltean
8f5d16f638 net: dsa: Be aware of switches where VLAN filtering is a global setting
On some switches, the action of whether to parse VLAN frame headers and use
that information for ingress admission is configurable, but not per
port. Such is the case for the Broadcom BCM53xx and the NXP SJA1105
families, for example. In that case, DSA can prevent the bridge core
from trying to apply different VLAN filtering settings on net devices
that belong to the same switch.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Suggested-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:28 -04:00
Vladimir Oltean
33162e9a05 net: dsa: Store vlan_filtering as a property of dsa_port
This allows drivers to query the VLAN setting imposed by the bridge
driver directly from DSA, instead of keeping their own state based on
the .port_vlan_filtering callback.

Signed-off-by: Vladimir Oltean <olteanv@gmail.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 23:05:28 -04:00
David S. Miller
a658a3f2ec Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2019-04-30

1) A lot of work to remove indirections from the xfrm code.
   From Florian Westphal.

2) Support ESP offload in combination with gso partial.
   From Boris Pismenny.

3) Remove some duplicated code from vti4.
   From Jeremy Sowden.

Please note that there is merge conflict

between commit:

8742dc86d0 ("xfrm4: Fix uninitialized memory read in _decode_session4")

from the ipsec tree and commit:

c53ac41e37 ("xfrm: remove decode_session indirection from afinfo_policy")

from the ipsec-next tree. The merge conflict will appear
when those trees get merged during the merge window.
The conflict can be solved as it is done in linux-next:

https://lkml.org/lkml/2019/4/25/1207

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 09:26:13 -04:00
David S. Miller
b145745fc8 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2019-04-30

1) Fix an out-of-bound array accesses in __xfrm_policy_unlink.
   From YueHaibing.

2) Reset the secpath on failure in the ESP GRO handlers
   to avoid dereferencing an invalid pointer on error.
   From Myungho Jung.

3) Add and revert a patch that tried to add rcu annotations
   to netns_xfrm. From Su Yanjun.

4) Wait for rcu callbacks before freeing xfrm6_tunnel_spi_kmem.
   From Su Yanjun.

5) Fix forgotten vti4 ipip tunnel deregistration.
   From Jeremy Sowden:

6) Remove some duplicated log messages in vti4.
   From Jeremy Sowden.

7) Don't use IPSEC_PROTO_ANY when flushing states because
   this will flush only IPsec portocol speciffic states.
   IPPROTO_ROUTING states may remain in the lists when
   doing net exit. Fix this by replacing IPSEC_PROTO_ANY
   with zero. From Cong Wang.

8) Add length check for UDP encapsulation to fix "Oversized IP packet"
   warnings on receive side. From Sabrina Dubroca.

9) Fix xfrm interface lookup when the interface is associated to
   a vrf layer 3 master device. From Martin Willi.

10) Reload header pointers after pskb_may_pull() in _decode_session4(),
    otherwise we may read from uninitialized memory.

11) Update the documentation about xfrm[46]_gc_thresh, it
    is not used anymore after the flowcache removal.
    From Nicolas Dichtel.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-30 09:11:10 -04:00
Flavio Leitner
08010a2160 netfilter: add API to manage NAT helpers.
The API allows a conntrack helper to indicate its corresponding
NAT helper which then can be loaded and reference counted.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:55 +02:00
Flavio Leitner
e1f172e162 netfilter: use macros to create module aliases.
Each NAT helper creates a module alias which follows a pattern.
Use macros for consistency.

Signed-off-by: Flavio Leitner <fbl@redhat.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:19:54 +02:00
Tonghao Zhang
8f14c99c7e netfilter: conntrack: limit sysctl setting for boolean options
We use the zero and one to limit the boolean options setting.
After this patch we only set 0 or 1 to boolean options for nf
conntrack sysctl.

Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 14:18:56 +02:00
Paul Gortmaker
a4cb98f32c netfilter: nf_tables: drop include of module.h from nf_tables.h
Ideally, header files under include/linux shouldn't be adding
includes of other headers, in anticipation of their consumers,
but just the headers needed for the header itself to pass
parsing with CPP.

The module.h is particularly bad in this sense, as it itself does
include a whole bunch of other headers, due to the complexity of
module support.

Since nf_tables.h is not going into a module struct looking for
specific fields, we can just let it know that module is a struct,
just like about 60 other include/linux headers already do.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:35:32 +02:00
Paul Gortmaker
f1f86d09ca netfilter: nf_tables: relocate header content to consumer
The nf_tables.h header is used in a lot of files, but it turns out
that there is only one actual user of nft_expr_clone().

Hence we relocate that function to be with the one consumer of it
and avoid having to process it with CPP for all the other files.

This will also enable a reduction in the other headers that the
nf_tables.h itself has to include just to be stand-alone, hence
a pending further significant reduction in the CPP content that
needs to get processed for each netfilter file.

Note that the explicit "inline" has been dropped as part of this
relocation.  In similar changes to this, I believe Dave has asked
this be done, so we free up gcc to make the choice of whether to
inline or not.

Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-30 13:35:17 +02:00
Andrew Lunn
f81a43e8da dsa: Cleanup unneeded table and make tag structures static
Now that tag drivers dynamically register, we don't need the static
table. Remove it. This also means the tag driver structures can be
made static.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
d3b8c04988 dsa: Add boilerplate helper to register DSA tag driver modules
A DSA tag driver module will need to register the tag protocols it
implements with the DSA core. Add macros containing this boiler plate.

The registration/unregistration code is currently just a stub. A Later
patch will add the real implementation.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2
Fix indent of #endif
Rewrite to move list pointer into a new structure
v3
Move kdoc next to macro
Fix THIS_MODULE indentation

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
056eed2fb0 dsa: Add TAG protocol to tag ops
In order that we can match the tagging protocol a switch driver
request to the tagger, we need to know what protocol the tagger
supports. Add this information to the ops structure.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2
More tag protocol to end of structure to keep hot members at the beginning.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
0b42f03363 dsa: Add MODULE_ALIAS to taggers in preparation to become modules
When the tag drivers become modules, we will need to dynamically load
them based on what the switch drivers need. Add aliases to map between
the TAG protocol and the driver.

In order to do this, we need the tag protocol number as something
which the C pre-processor can stringinfy. Only the compiler knows the
value of an enum, CPP cannot use them. So add #defines.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:01 -04:00
Andrew Lunn
875138f81d dsa: Move tagger name into its ops structure
Rather than keep a list to map a tagger ops to a name, place the name
into the ops structure. This removes the hard coded list, a step
towards making the taggers more dynamic.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>

v2:
Move name to end of structure, keeping the hot entries at the beginning.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 19:41:00 -04:00
David S. Miller
5f0d736e7f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-04-28

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Introduce BPF socket local storage map so that BPF programs can store
   private data they associate with a socket (instead of e.g. separate hash
   table), from Martin.

2) Add support for bpftool to dump BTF types. This is done through a new
   `bpftool btf dump` sub-command, from Andrii.

3) Enable BPF-based flow dissector for skb-less eth_get_headlen() calls which
   was currently not supported since skb was used to lookup netns, from Stanislav.

4) Add an opt-in interface for tracepoints to expose a writable context
   for attached BPF programs, used here for NBD sockets, from Matt.

5) BPF xadd related arm64 JIT fixes and scalability improvements, from Daniel.

6) Change the skb->protocol for bpf_skb_adjust_room() helper in order to
   support tunnels such as sit. Add selftests as well, from Willem.

7) Various smaller misc fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-28 08:42:41 -04:00
Johannes Berg
ef6243acb4 genetlink: optionally validate strictly/dumps
Add options to strictly validate messages and dump messages,
sometimes perhaps validating dump messages non-strictly may
be required, so add an option for that as well.

Since none of this can really be applied to existing commands,
set the options everwhere using the following spatch:

    @@
    identifier ops;
    expression X;
    @@
    struct genl_ops ops[] = {
    ...,
     {
            .cmd = X,
    +       .validate = GENL_DONT_VALIDATE_STRICT | GENL_DONT_VALIDATE_DUMP,
            ...
     },
    ...
    };

For new commands one should just not copy the .validate 'opt-out'
flags and thus get strict validation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:22 -04:00
Johannes Berg
56738f4608 netlink: add strict parsing for future attributes
Unfortunately, we cannot add strict parsing for all attributes, as
that would break existing userspace. We currently warn about it, but
that's about all we can do.

For new attributes, however, the story is better: nobody is using
them, so we can reject bad sizes.

Also, for new attributes, we need not accept them when the policy
doesn't declare their usage.

David Ahern and I went back and forth on how to best encode this, and
the best way we found was to have a "boundary type", from which point
on new attributes have all possible validation applied, and NLA_UNSPEC
is rejected.

As we didn't want to add another argument to all functions that get a
netlink policy, the workaround is to encode that boundary in the first
entry of the policy array (which is for type 0 and thus probably not
really valid anyway). I put it into the validation union for the rare
possibility that somebody is actually using attribute 0, which would
continue to work fine unless they tried to use the extended validation,
which isn't likely. We also didn't find any in-tree users with type 0.

The reason for setting the "start strict here" attribute is that we
never really need to start strict from 0, which is invalid anyway (or
in legacy families where that isn't true, it cannot be set to strict),
so we can thus reserve the value 0 for "don't do this check" and don't
have to add the tag to all policies right now.

Thus, policies can now opt in to this validation, which we should do
for all existing policies, at least when adding new attributes.

Note that entirely *new* policies won't need to set it, as the use
of that should be using nla_parse()/nlmsg_parse() etc. which anyway
do fully strict validation now, regardless of this.

So in effect, this patch only covers the "existing command with new
attribute" case.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:22 -04:00
Johannes Berg
3de6440354 netlink: re-add parse/validate functions in strict mode
This re-adds the parse and validate functions like nla_parse()
that are now actually strict after the previous rename and were
just split out to make sure everything is converted (and if not
compilation of the previous patch would fail.)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:22 -04:00
Johannes Berg
8cb081746c netlink: make validation more configurable for future strictness
We currently have two levels of strict validation:

 1) liberal (default)
     - undefined (type >= max) & NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted
     - garbage at end of message accepted
 2) strict (opt-in)
     - NLA_UNSPEC attributes accepted
     - attribute length >= expected accepted

Split out parsing strictness into four different options:
 * TRAILING     - check that there's no trailing data after parsing
                  attributes (in message or nested)
 * MAXTYPE      - reject attrs > max known type
 * UNSPEC       - reject attributes with NLA_UNSPEC policy entries
 * STRICT_ATTRS - strictly validate attribute size

The default for future things should be *everything*.
The current *_strict() is a combination of TRAILING and MAXTYPE,
and is renamed to _deprecated_strict().
The current regular parsing has none of this, and is renamed to
*_parse_deprecated().

Additionally it allows us to selectively set one of the new flags
even on old policies. Notably, the UNSPEC flag could be useful in
this case, since it can be arranged (by filling in the policy) to
not be an incompatible userspace ABI change, but would then going
forward prevent forgetting attribute entries. Similar can apply
to the POLICY flag.

We end up with the following renames:
 * nla_parse           -> nla_parse_deprecated
 * nla_parse_strict    -> nla_parse_deprecated_strict
 * nlmsg_parse         -> nlmsg_parse_deprecated
 * nlmsg_parse_strict  -> nlmsg_parse_deprecated_strict
 * nla_parse_nested    -> nla_parse_nested_deprecated
 * nla_validate_nested -> nla_validate_nested_deprecated

Using spatch, of course:
    @@
    expression TB, MAX, HEAD, LEN, POL, EXT;
    @@
    -nla_parse(TB, MAX, HEAD, LEN, POL, EXT)
    +nla_parse_deprecated(TB, MAX, HEAD, LEN, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, TB, MAX, POL, EXT;
    @@
    -nlmsg_parse_strict(NLH, HDRLEN, TB, MAX, POL, EXT)
    +nlmsg_parse_deprecated_strict(NLH, HDRLEN, TB, MAX, POL, EXT)

    @@
    expression TB, MAX, NLA, POL, EXT;
    @@
    -nla_parse_nested(TB, MAX, NLA, POL, EXT)
    +nla_parse_nested_deprecated(TB, MAX, NLA, POL, EXT)

    @@
    expression START, MAX, POL, EXT;
    @@
    -nla_validate_nested(START, MAX, POL, EXT)
    +nla_validate_nested_deprecated(START, MAX, POL, EXT)

    @@
    expression NLH, HDRLEN, MAX, POL, EXT;
    @@
    -nlmsg_validate(NLH, HDRLEN, MAX, POL, EXT)
    +nlmsg_validate_deprecated(NLH, HDRLEN, MAX, POL, EXT)

For this patch, don't actually add the strict, non-renamed versions
yet so that it breaks compile if I get it wrong.

Also, while at it, make nla_validate and nla_parse go down to a
common __nla_validate_parse() function to avoid code duplication.

Ultimately, this allows us to have very strict validation for every
new caller of nla_parse()/nlmsg_parse() etc as re-introduced in the
next patch, while existing things will continue to work as is.

In effect then, this adds fully strict validation for any new command.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Johannes Berg
6f455f5f4e netlink: add NLA_MIN_LEN
Rather than using NLA_UNSPEC for this type of thing, use NLA_MIN_LEN
so we can make NLA_UNSPEC be NLA_REJECT under certain conditions for
future attributes.

While at it, also use NLA_EXACT_LEN for the struct example.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:07:21 -04:00
Michal Kubecek
ae0be8de9a netlink: make nla_nest_start() add NLA_F_NESTED flag
Even if the NLA_F_NESTED flag was introduced more than 11 years ago, most
netlink based interfaces (including recently added ones) are still not
setting it in kernel generated messages. Without the flag, message parsers
not aware of attribute semantics (e.g. wireshark dissector or libmnl's
mnl_nlmsg_fprintf()) cannot recognize nested attributes and won't display
the structure of their contents.

Unfortunately we cannot just add the flag everywhere as there may be
userspace applications which check nlattr::nla_type directly rather than
through a helper masking out the flags. Therefore the patch renames
nla_nest_start() to nla_nest_start_noflag() and introduces nla_nest_start()
as a wrapper adding NLA_F_NESTED. The calls which add NLA_F_NESTED manually
are rewritten to use nla_nest_start().

Except for changes in include/net/netlink.h, the patch was generated using
this semantic patch:

@@ expression E1, E2; @@
-nla_nest_start(E1, E2)
+nla_nest_start_noflag(E1, E2)

@@ expression E1, E2; @@
-nla_nest_start_noflag(E1, E2 | NLA_F_NESTED)
+nla_nest_start(E1, E2)

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 17:03:44 -04:00
Jakub Kicinski
63a1c95f3f net/tls: byte swap device req TCP seq no upon setting
To avoid a sparse warning byteswap the be32 sequence number
before it's stored in the atomic value.  While at it drop
unnecessary brackets and use kernel's u64 type.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 16:52:21 -04:00
Jakub Kicinski
da68b4ad02 net/tls: move definition of tls ops into net/tls.h
There seems to be no reason for tls_ops to be defined in netdevice.h
which is included in a lot of places.  Don't wrap the struct/enum
declaration in ifdefs, it trickles down unnecessary ifdefs into
driver code.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 16:52:21 -04:00
Jakub Kicinski
9e9957973c net/tls: remove old exports of sk_destruct functions
tls_device_sk_destruct being set on a socket used to indicate
that socket is a kTLS device one.  That is no longer true -
now we use sk_validate_xmit_skb pointer for that purpose.
Remove the export.  tls_device_attach() needs to be moved.

While at it, remove the dead declaration of tls_sk_destruct().

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-27 16:52:21 -04:00
Martin KaFai Lau
6ac99e8f23 bpf: Introduce bpf sk local storage
After allowing a bpf prog to
- directly read the skb->sk ptr
- get the fullsock bpf_sock by "bpf_sk_fullsock()"
- get the bpf_tcp_sock by "bpf_tcp_sock()"
- get the listener sock by "bpf_get_listener_sock()"
- avoid duplicating the fields of "(bpf_)sock" and "(bpf_)tcp_sock"
  into different bpf running context.

this patch is another effort to make bpf's network programming
more intuitive to do (together with memory and performance benefit).

When bpf prog needs to store data for a sk, the current practice is to
define a map with the usual 4-tuples (src/dst ip/port) as the key.
If multiple bpf progs require to store different sk data, multiple maps
have to be defined.  Hence, wasting memory to store the duplicated
keys (i.e. 4 tuples here) in each of the bpf map.
[ The smallest key could be the sk pointer itself which requires
  some enhancement in the verifier and it is a separate topic. ]

Also, the bpf prog needs to clean up the elem when sk is freed.
Otherwise, the bpf map will become full and un-usable quickly.
The sk-free tracking currently could be done during sk state
transition (e.g. BPF_SOCK_OPS_STATE_CB).

The size of the map needs to be predefined which then usually ended-up
with an over-provisioned map in production.  Even the map was re-sizable,
while the sk naturally come and go away already, this potential re-size
operation is arguably redundant if the data can be directly connected
to the sk itself instead of proxy-ing through a bpf map.

This patch introduces sk->sk_bpf_storage to provide local storage space
at sk for bpf prog to use.  The space will be allocated when the first bpf
prog has created data for this particular sk.

The design optimizes the bpf prog's lookup (and then optionally followed by
an inline update).  bpf_spin_lock should be used if the inline update needs
to be protected.

BPF_MAP_TYPE_SK_STORAGE:
-----------------------
To define a bpf "sk-local-storage", a BPF_MAP_TYPE_SK_STORAGE map (new in
this patch) needs to be created.  Multiple BPF_MAP_TYPE_SK_STORAGE maps can
be created to fit different bpf progs' needs.  The map enforces
BTF to allow printing the sk-local-storage during a system-wise
sk dump (e.g. "ss -ta") in the future.

The purpose of a BPF_MAP_TYPE_SK_STORAGE map is not for lookup/update/delete
a "sk-local-storage" data from a particular sk.
Think of the map as a meta-data (or "type") of a "sk-local-storage".  This
particular "type" of "sk-local-storage" data can then be stored in any sk.

The main purposes of this map are mostly:
1. Define the size of a "sk-local-storage" type.
2. Provide a similar syscall userspace API as the map (e.g. lookup/update,
   map-id, map-btf...etc.)
3. Keep track of all sk's storages of this "type" and clean them up
   when the map is freed.

sk->sk_bpf_storage:
------------------
The main lookup/update/delete is done on sk->sk_bpf_storage (which
is a "struct bpf_sk_storage").  When doing a lookup,
the "map" pointer is now used as the "key" to search on the
sk_storage->list.  The "map" pointer is actually serving
as the "type" of the "sk-local-storage" that is being
requested.

To allow very fast lookup, it should be as fast as looking up an
array at a stable-offset.  At the same time, it is not ideal to
set a hard limit on the number of sk-local-storage "type" that the
system can have.  Hence, this patch takes a cache approach.
The last search result from sk_storage->list is cached in
sk_storage->cache[] which is a stable sized array.  Each
"sk-local-storage" type has a stable offset to the cache[] array.
In the future, a map's flag could be introduced to do cache
opt-out/enforcement if it became necessary.

The cache size is 16 (i.e. 16 types of "sk-local-storage").
Programs can share map.  On the program side, having a few bpf_progs
running in the networking hotpath is already a lot.  The bpf_prog
should have already consolidated the existing sock-key-ed map usage
to minimize the map lookup penalty.  16 has enough runway to grow.

All sk-local-storage data will be removed from sk->sk_bpf_storage
during sk destruction.

bpf_sk_storage_get() and bpf_sk_storage_delete():
------------------------------------------------
Instead of using bpf_map_(lookup|update|delete)_elem(),
the bpf prog needs to use the new helper bpf_sk_storage_get() and
bpf_sk_storage_delete().  The verifier can then enforce the
ARG_PTR_TO_SOCKET argument.  The bpf_sk_storage_get() also allows to
"create" new elem if one does not exist in the sk.  It is done by
the new BPF_SK_STORAGE_GET_F_CREATE flag.  An optional value can also be
provided as the initial value during BPF_SK_STORAGE_GET_F_CREATE.
The BPF_MAP_TYPE_SK_STORAGE also supports bpf_spin_lock.  Together,
it has eliminated the potential use cases for an equivalent
bpf_map_update_elem() API (for bpf_prog) in this patch.

Misc notes:
----------
1. map_get_next_key is not supported.  From the userspace syscall
   perspective,  the map has the socket fd as the key while the map
   can be shared by pinned-file or map-id.

   Since btf is enforced, the existing "ss" could be enhanced to pretty
   print the local-storage.

   Supporting a kernel defined btf with 4 tuples as the return key could
   be explored later also.

2. The sk->sk_lock cannot be acquired.  Atomic operations is used instead.
   e.g. cmpxchg is done on the sk->sk_bpf_storage ptr.
   Please refer to the source code comments for the details in
   synchronization cases and considerations.

3. The mem is charged to the sk->sk_omem_alloc as the sk filter does.

Benchmark:
---------
Here is the benchmark data collected by turning on
the "kernel.bpf_stats_enabled" sysctl.
Two bpf progs are tested:

One bpf prog with the usual bpf hashmap (max_entries = 8192) with the
sk ptr as the key. (verifier is modified to support sk ptr as the key
That should have shortened the key lookup time.)

Another bpf prog is with the new BPF_MAP_TYPE_SK_STORAGE.

Both are storing a "u32 cnt", do a lookup on "egress_skb/cgroup" for
each egress skb and then bump the cnt.  netperf is used to drive
data with 4096 connected UDP sockets.

BPF_MAP_TYPE_HASH with a modifier verifier (152ns per bpf run)
27: cgroup_skb  name egress_sk_map  tag 74f56e832918070b run_time_ns 58280107540 run_cnt 381347633
    loaded_at 2019-04-15T13:46:39-0700  uid 0
    xlated 344B  jited 258B  memlock 4096B  map_ids 16
    btf_id 5

BPF_MAP_TYPE_SK_STORAGE in this patch (66ns per bpf run)
30: cgroup_skb  name egress_sk_stora  tag d4aa70984cc7bbf6 run_time_ns 25617093319 run_cnt 390989739
    loaded_at 2019-04-15T13:47:54-0700  uid 0
    xlated 168B  jited 156B  memlock 4096B  map_ids 17
    btf_id 6

Here is a high-level picture on how are the objects organized:

       sk
    ┌──────┐
    │      │
    │      │
    │      │
    │*sk_bpf_storage─────▶ bpf_sk_storage
    └──────┘                 ┌───────┐
                 ┌───────────┤ list  │
                 │           │       │
                 │           │       │
                 │           │       │
                 │           └───────┘
                 │
                 │     elem
                 │  ┌────────┐
                 ├─▶│ snode  │
                 │  ├────────┤
                 │  │  data  │          bpf_map
                 │  ├────────┤        ┌─────────┐
                 │  │map_node│◀─┬─────┤  list   │
                 │  └────────┘  │     │         │
                 │              │     │         │
                 │     elem     │     │         │
                 │  ┌────────┐  │     └─────────┘
                 └─▶│ snode  │  │
                    ├────────┤  │
   bpf_map          │  data  │  │
 ┌─────────┐        ├────────┤  │
 │  list   ├───────▶│map_node│  │
 │         │        └────────┘  │
 │         │                    │
 │         │           elem     │
 └─────────┘        ┌────────┐  │
                 ┌─▶│ snode  │  │
                 │  ├────────┤  │
                 │  │  data  │  │
                 │  ├────────┤  │
                 │  │map_node│◀─┘
                 │  └────────┘
                 │
                 │
                 │          ┌───────┐
     sk          └──────────│ list  │
  ┌──────┐                  │       │
  │      │                  │       │
  │      │                  │       │
  │      │                  └───────┘
  │*sk_bpf_storage───────▶bpf_sk_storage
  └──────┘

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-04-27 09:07:04 -07:00
David S. Miller
30e5a9a5ba Various updates, notably:
* extended key ID support (from 802.11-2016)
  * per-STA TX power control support
  * mac80211 TX performance improvements
  * HE (802.11ax) updates
  * mesh link probing support
  * enhancements of multi-BSSID support (also related to HE)
  * OWE userspace processing support
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlzC5YYACgkQB8qZga/f
 l8QDWg/+N7wm+l7bTMx4hjJzZZ60n9fBvyGJx0gsnPVH8wdOiPoh/epuI04I8I4m
 pGNbGvPB9Z4z2tD56XsIQnXf88ab3R27bRupSSW1vtzVSbDhg8wQ7jg0nABrdyDS
 PgoTmDMfVERLewXdntqRANzVYGfoWSOzo1u6A0Xhys8FqxxX/eD+Vdo4dKzmeN47
 +LDfuCpInVPn0TOpFp5IJ4+B4a0dhkz2/Q1BOE7NquXVvk4X77VJohV/BgQJ04Io
 yt7mn5rzYM6j4o1XLACxUEHkXvht6h34abG0yHRnuoAEp/sdPz2jAXT4OxYqs6x0
 XdLdr8gZgkMnnYaOQef74uJ2Ku+4A1ootjXSPazA7BWX0X5GqHnET/INk2S6cQPj
 C95LYfKC0ICD0qfioBmmHx8icDGoovcaswCju2ozfqWaD4Lwr3BcesnNDFtkHD9o
 aYaTTGGSxFyr2bZWTDpv4D4H5g3V4srRJsXs+SokL54nvlwd/smUJ4PVTLomP9y2
 XswRtLdoiUsCrJy967CXfhsxnE5SRhmBQE38Jq8/pzetlRk2spvJJC5MGYF0O/nT
 0UHbrjBCFUT2s8jv+gWWabOBUovsNJlgaxFwrZ/eNVIk0DK0ERoMV3V4MktU8uza
 Y339T14kxw4wlY2z5pOmEgkxmKZbPb55dBba04JEZzz9zDTawTk=
 =JQOx
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2019-04-26' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Various updates, notably:
 * extended key ID support (from 802.11-2016)
 * per-STA TX power control support
 * mac80211 TX performance improvements
 * HE (802.11ax) updates
 * mesh link probing support
 * enhancements of multi-BSSID support (also related to HE)
 * OWE userspace processing support
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 16:05:52 -04:00
Rajkumar Manoharan
8828f81ad4 mac80211: probe unexercised mesh links
The requirement for mesh link metric refreshing, is that from one
mesh point we be able to send some data frames to other mesh points
which are not currently selected as a primary traffic path, but which
are only 1 hop away. The absence of the primary path to the chosen node
makes it necessary to apply some form of marking on a chosen packet
stream so that the packets can be properly steered to the selected node
for testing, and not by the regular mesh path lookup.

Tested-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Rajkumar Manoharan
5ab92e7fe4 cfg80211: add support to probe unexercised mesh link
Adding support to allow mesh HWMP to measure link metrics on unexercised
direct mesh path by sending some data frames to other mesh points which
are not currently selected as a primary traffic path but only 1 hop away.
The absence of the primary path to the chosen node makes it necessary to
apply some form of marking on a chosen packet stream so that the packets
can be properly steered to the selected node for testing, and not by the
regular mesh path lookup.

Tested-by: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Dan Carpenter
5809a5d54b cfg80211: don't pass pointer to pointer unnecessarily
The cfg80211_merge_profile() and ieee802_11_find_bssid_profile() are
a bit cleaner if we just pass the merged_ie pointer instead of a pointer
to the pointer.

This isn't a functional change, it's just a clean up.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Ashok Raj Nagarajan
ba905bf432 mac80211: store tx power value from user to station
This patch introduce a new driver callback drv_sta_set_txpwr. This API will
copy the transmit power value passed from user space and call the driver
callback to set the tx power for the station.

Co-developed-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Ashok Raj Nagarajan <arnagara@codeaurora.org>
Signed-off-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Ashok Raj Nagarajan
e96d1cd263 cfg80211: Add support to set tx power for a station associated
This patch adds support to set transmit power setting type and transmit
power level attributes to NL80211_CMD_SET_STATION in order to facilitate
adjusting the transmit power level of a station associated to the AP.

The added attributes allow selection of automatic and limited transmit
power level, with the level defined in dBm format.

Co-developed-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Ashok Raj Nagarajan <arnagara@codeaurora.org>
Signed-off-by: Balaji Pothunoori <bpothuno@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
96fc6efb9a mac80211: IEEE 802.11 Extended Key ID support
Add support for Extended Key ID as defined in IEEE 802.11-2016.

 - Implement the nl80211 API for Extended Key ID
 - Extend mac80211 API to allow drivers to support Extended Key ID
 - Enable Extended Key ID by default for drivers only supporting SW
   crypto (e.g. mac80211_hwsim)
 - Allow unicast Tx usage to be supressed (IEEE80211_KEY_FLAG_NO_AUTO_TX)
 - Select the decryption key based on the MPDU keyid
 - Enforce existing assumptions in the code that rekeys don't change the
   cipher

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[remove module parameter]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Alexander Wetzel
6cdd3979a2 nl80211/cfg80211: Extended Key ID support
Add support for IEEE 802.11-2016 "Extended Key ID for Individually
Addressed Frames".

Extend cfg80211 and nl80211 to allow pairwise keys to be installed for
Rx only, enable Tx separately and allow Key ID 1 for pairwise keys.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[use NLA_POLICY_RANGE() for NL80211_KEY_MODE]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Felix Fietkau
f2af2df800 mac80211: calculate hash for fq without holding fq->lock in itxq enqueue
Reduces lock contention on enqueue/dequeue of iTXQ packets

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
fe806e4992 cfg80211: support profile split between elements
Since an element is limited to 255 octets, a profile may be split
split to several elements. Support the split as defined in the 11ax
draft 3. Detect legacy split and print a net-rate limited warning,
since there is no ROI in supporting this probably non-existent
split.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
Sara Sharon
f7dacfb114 cfg80211: support non-inheritance element
Subelement profile may specify element IDs it doesn't inherit
from the management frame. Support it.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-26 13:02:11 +02:00
David S. Miller
89eb6e0967 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2019-04-25

Here's the main bluetooth-next pull request for the 5.2 kernel.

 - Added support for Mediatek SDIO controllers
 - Added support for Broadcom BCM2076B1 UART controller
 - Added support for Marvel SD8987 chipset
 - Fix buffer overflow bug in hidp protocol
 - Various other smaller fixes & improvements

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-26 01:39:38 -04:00
David S. Miller
8b44836583 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Two easy cases of overlapping changes.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-25 23:52:29 -04:00
Marcel Holtmann
d5bb334a8e Bluetooth: Align minimum encryption key size for LE and BR/EDR connections
The minimum encryption key size for LE connections is 56 bits and to
align LE with BR/EDR, enforce 56 bits of minimum encryption key size for
BR/EDR connections as well.

Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Cc: stable@vger.kernel.org
2019-04-25 06:18:07 -07:00
David Ahern
a65120bae4 ipv6: Use result arg in fib_lookup_arg consistently
arg.result is sometimes used as fib6_result and sometimes used to
hold the rt6_info. Add rt6_info to fib6_result and make the use
of arg.result consistent through ipv6 rules.

The rt6 entry is filled in for lookups returning a dst_entry, but not
for direct fib_lookups that just want a fib6_info.

Fixes: effda4dd97 ("ipv6: Pass fib6_result to fib lookups")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 21:53:11 -07:00
David Ahern
ecc5663cce net: Change nhc_flags to unsigned char
nhc_flags holds the RTNH_F flags for a given nexthop (fib{6}_nh).
All of the RTNH_F_ flags fit in an unsigned char, and since the API to
userspace (rtnh_flags and lower byte of rtm_flags) is 1 byte it can not
grow. Make nhc_flags in fib_nh_common an unsigned char and shrink the
size of the struct by 8, from 56 to 48 bytes.

Update the flags arguments for up netdevice events and fib_nexthop_info
which determines the RTNH_F flags to return on a dump/event. The RTNH_F
flags are passed in the lower byte of rtm_flags which is an unsigned int
so use a temp variable for the flags to fib_nexthop_info and combine
with rtm_flags in the caller.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:44:18 -07:00
David Ahern
ffa8ce54be lwtunnel: Pass encap and encap type attributes to lwtunnel_fill_encap
Currently, lwtunnel_fill_encap hardcodes the encap and encap type
attributes as RTA_ENCAP and RTA_ENCAP_TYPE, respectively. The nexthop
objects want to re-use this code but the encap attributes passed to
userspace as NHA_ENCAP and NHA_ENCAP_TYPE. Since that is the only
difference, change lwtunnel_fill_encap to take the attribute type as
an input.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 19:42:29 -07:00
Eric Dumazet
f05713e091 ipv6: convert fib6_ref to refcount_t
We suspect some issues involving fib6_ref 0 -> 1 transitions might
cause strange syzbot reports.

Lets convert fib6_ref to refcount_t to catch them earlier.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Wei Wang <weiwan@google.com>
Acked-by: Wei Wang <weiwan@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-23 17:19:48 -07:00
Stanislav Fomichev
089b19a920 flow_dissector: switch kernel context to struct bpf_flow_dissector
struct bpf_flow_dissector has a small subset of sk_buff fields that
flow dissector BPF program is allowed to access and an optional
pointer to real skb. Real skb is used only in bpf_skb_load_bytes
helper to read non-linear data.

The real motivation for this is to be able to call flow dissector
from eth_get_headlen context where we don't have an skb and need
to dissect raw bytes.

Signed-off-by: Stanislav Fomichev <sdf@google.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-04-23 18:36:33 +02:00
Florian Westphal
bb9cd077e2 xfrm: remove unneeded export_symbols
None of them have any external callers, make them static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Florian Westphal
c53ac41e37 xfrm: remove decode_session indirection from afinfo_policy
No external dependencies, might as well handle this directly.
xfrm_afinfo_policy is now 40 bytes on x86_64.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Florian Westphal
2e8b4aa816 xfrm: remove init_path indirection from afinfo_policy
handle this directly, its only used by ipv6.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Florian Westphal
f24ea52873 xfrm: remove tos indirection from afinfo_policy
Only used by ipv4, we can read the fl4 tos value directly instead.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-23 07:42:20 +02:00
Ido Schimmel
f2ad1a522e net: devlink: Add extack to shared buffer operations
Add extack to shared buffer set operations, so that meaningful error
messages could be propagated to the user.

Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 22:09:32 -07:00
Paul Gortmaker
a130f9b275 net: tc_act: drop include of module.h from tc_ife.h
Ideally, header files under include/linux shouldn't be adding
includes of other headers, in anticipation of their consumers,
but just the headers needed for the header itself to pass
parsing with CPP.

The module.h is particularly bad in this sense, as it itself does
include a whole bunch of other headers, due to the complexity of
module support.

Since tc_ife.h is not going into a module struct looking for
specific fields, we can just let it know that module is a struct,
just like about 60 other include/linux headers already do.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:53 -07:00
Paul Gortmaker
113e632866 net: fib: drop include of module.h from fib_notifier.h
Ideally, header files under include/linux shouldn't be adding
includes of other headers, in anticipation of their consumers,
but just the headers needed for the header itself to pass
parsing with CPP.

The module.h is particularly bad in this sense, as it itself does
include a whole bunch of other headers, due to the complexity of
module support.

Since fib_notifier.h is not going into a module struct looking for
specific fields, we can just let it know that module is a struct,
just like about 60 other include/linux headers already do.

Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:53 -07:00
Paul Gortmaker
c517796ea9 net: ife: drop include of module.h from net/ife.h
Ideally, header files under include/linux shouldn't be adding
includes of other headers, in anticipation of their consumers,
but just the headers needed for the header itself to pass
parsing with CPP.

The module.h is particularly bad in this sense, as it itself does
include a whole bunch of other headers, due to the complexity of
module support.

There doesn't appear to be anything in net/ife.h that is module
related, and build coverage doesn't appear to show any other
files/drivers relying implicitly on getting it from here.

So it appears we are simply free to just remove it in this case.

Cc: Yotam Gigi <yotam.gi@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:53 -07:00
Paul Gortmaker
a79eda3aaf net: psample: drop include of module.h from psample.h
Ideally, header files under include/linux shouldn't be adding
includes of other headers, in anticipation of their consumers,
but just the headers needed for the header itself to pass
parsing with CPP.

The module.h is particularly bad in this sense, as it itself does
include a whole bunch of other headers, due to the complexity of
module support.

There doesn't appear to be anything in psample.h that is module
related, and build coverage doesn't appear to show any other
files/drivers relying implicitly on getting it from here.

So it appears we are simply free to just remove it in this case.

Cc: Yotam Gigi <yotam.gi@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:50:53 -07:00
David Ahern
3c618c1dbb net: Rename net/nexthop.h net/rtnh.h
The header contains rtnh_ macros so rename the file accordingly.
Allows a later patch to use the nexthop.h name for the new
nexthop code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:47:25 -07:00
David Ahern
7e5f4cdb28 ipv6: Remove fib6_info_nh_lwt
fib6_info_nh_lwt is no longer used; remove it.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:46:42 -07:00
Daniel T. Lee
0b13c9bb96 include/net/tcp.h: whitespace cleanup at tcp_v4_check
This patch makes trivial whitespace fix to the function
tcp_v4_check at include/net/tcp.h file.

It has stylistic issue, which is "space required after that ','"
and it can be confirmed with ./scripts/checkpatch.pl tool.

    ERROR: space required after that ',' (ctx:VxV)
    #29: FILE: include/net/tcp.h:1317:
    +	        return csum_tcpudp_magic(saddr,daddr,len,IPPROTO_TCP,base);
         	                              ^

Signed-off-by: Daniel T. Lee <danieltimlee@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-22 21:45:58 -07:00
David Ahern
be659b8d3c ipv6: Restore RTF_ADDRCONF check in rt6_qualify_for_ecmp
The RTF_ADDRCONF flag filters out routes added by RA's in determining
which routes can be appended to an existing one to create a multipath
route. Restore the flag check and add a comment to document the RA piece.

Fixes: 4e54507ab1 ("ipv6: Simplify rt6_qualify_for_ecmp")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-21 19:44:16 -07:00
David Ahern
4e54507ab1 ipv6: Simplify rt6_qualify_for_ecmp
After commit c7a1ce397a ("ipv6: Change addrconf_f6i_alloc to use
ip6_route_info_create"), the gateway is no longer filled in for fib6_nh
structs in a prefix route. Accordingly, the RTF_ADDRCONF flag check can
be dropped from the 'rt6_qualify_for_ecmp'.

Further, RTF_DYNAMIC is only set in rt6_info instances, so it can be
removed from the check as well.

This reduces rt6_qualify_for_ecmp and the mlxsw version to just checking
if the nexthop has a gateway which is the real indication of whether
entries can be coalesced into a multipath route.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-21 10:39:52 -07:00
Arnd Bergmann
c7cbdbf29f net: rework SIOCGSTAMP ioctl handling
The SIOCGSTAMP/SIOCGSTAMPNS ioctl commands are implemented by many
socket protocol handlers, and all of those end up calling the same
sock_get_timestamp()/sock_get_timestampns() helper functions, which
results in a lot of duplicate code.

With the introduction of 64-bit time_t on 32-bit architectures, this
gets worse, as we then need four different ioctl commands in each
socket protocol implementation.

To simplify that, let's add a new .gettstamp() operation in
struct proto_ops, and move ioctl implementation into the common
sock_ioctl()/compat_sock_ioctl_trans() functions that these all go
through.

We can reuse the sock_get_timestamp() implementation, but generalize
it so it can deal with both native and compat mode, as well as
timeval and timespec structures.

Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marc Kleine-Budde <mkl@pengutronix.de>
Link: https://lore.kernel.org/lkml/CAK8P3a038aDQQotzua_QtKGhq8O9n+rdiz2=WDCp82ys8eUT+A@mail.gmail.com/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-19 14:07:40 -07:00
Stephen Suryaputra
0bc1998544 ipv6: Add rate limit mask for ICMPv6 messages
To make ICMPv6 closer to ICMPv4, add ratemask parameter. Since the ICMP
message types use larger numeric values, a simple bitmask doesn't fit.
I use large bitmap. The input and output are the in form of list of
ranges. Set the default to rate limit all error messages but Packet Too
Big. For Packet Too Big, use ratemask instead of hard-coded.

There are functions where icmpv6_xrlim_allow() and icmpv6_global_allow()
aren't called. This patch only adds them to icmpv6_echo_reply().

Rate limiting error messages is mandated by RFC 4443 but RFC 4890 says
that it is also acceptable to rate limit informational messages. Thus,
I removed the current hard-coded behavior of icmpv6_mask_allow() that
doesn't rate limit informational messages.

v2: Add dummy function proc_do_large_bitmap() if CONFIG_PROC_SYSCTL
    isn't defined, expand the description in ip-sysctl.txt and remove
    unnecessary conditional before kfree().
v3: Inline the bitmap instead of dynamically allocated. Still is a
    pointer to it is needed because of the way proc_do_large_bitmap work.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-18 16:58:37 -07:00
David Ahern
b8fb1ab461 net ipv6: Prevent neighbor add if protocol is disabled on device
Disabling IPv6 on an interface removes existing entries but nothing prevents
new entries from being manually added. To that end, add a new neigh_table
operation, allow_add, that is called on RTM_NEWNEIGH to see if neighbor
entries are allowed on a given device. If IPv6 is disabled on the device,
allow_add returns false and passes a message back to the user via extack.

  $ echo 1 > /proc/sys/net/ipv6/conf/eth1/disable_ipv6
  $ ip -6 neigh add fe80::4c88:bff:fe21:2704 dev eth1 lladdr de:ad:be:ef:01:01
  Error: IPv6 is disabled on this device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:19:07 -07:00
David Ahern
7d21fec904 ipv6: Add fib6_type and fib6_flags to fib6_result
Add the fib6_flags and fib6_type to fib6_result. Update the lookup helpers
to set them and update post fib lookup users to use the version from the
result.

This allows nexthop objects to have blackhole nexthop.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:11:30 -07:00
David Ahern
effda4dd97 ipv6: Pass fib6_result to fib lookups
Change fib6_lookup and fib6_table_lookup to take a fib6_result and set
f6i and nh rather than returning a fib6_info. For now both always
return 0.

A later patch set can make these more like the IPv4 counterparts and
return EINVAL, EACCESS, etc based on fib6_type.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:47 -07:00
David Ahern
b748f26092 ipv6: Pass fib6_result to ip6_mtu_from_fib6 and fib6_mtu
Change ip6_mtu_from_fib6 and fib6_mtu to take a fib6_result over a
fib6_info. Update both to use the fib6_nh from fib6_result.

Since the signature of ip6_mtu_from_fib6 is already changing, add const
to daddr and saddr.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:10:46 -07:00
David Ahern
b1d4099150 ipv6: Rename fib6_multipath_select and pass fib6_result
Add 'struct fib6_result' to hold the fib entry and fib6_nh from a fib
lookup as separate entries, similar to what IPv4 now has with fib_result.

Rename fib6_multipath_select to fib6_select_path, pass fib6_result to
it, and set f6i and nh in the result once a path selection is done.
Call fib6_select_path unconditionally for path selection which means
moving the sibling and oif check to fib6_select_path. To handle the two
different call paths (2 only call multipath_select if flowi6_oif == 0 and
the other always calls it), add a new have_oif_match that controls the
sibling walk if relevant.

Update callers of fib6_multipath_select accordingly and have them use the
fib6_info and fib6_nh from the result.

This is needed for multipath nexthop objects where a single f6i can
point to multiple fib6_nh (similar to IPv4).

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 23:08:51 -07:00
David S. Miller
6b0a7f84ea Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflict resolution of af_smc.c from Stephen Rothwell.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-17 11:26:25 -07:00
Xin Long
9dde27de3e sctp: implement memory accounting on rx path
sk_forward_alloc's updating is also done on rx path, but to be consistent
we change to use sk_mem_charge() in sctp_skb_set_owner_r().

In sctp_eat_data(), it's not enough to check sctp_memory_pressure only,
which doesn't work for mem_cgroup_sockets_enabled, so we change to use
sk_under_memory_pressure().

When it's under memory pressure, sk_mem_reclaim() and sk_rmem_schedule()
should be called on both RENEGE or CHUNK DELIVERY path exit the memory
pressure status as soon as possible.

Note that sk_rmem_schedule() is using datalen to make things easy there.

Reported-by: Matteo Croce <mcroce@redhat.com>
Tested-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 13:36:51 -07:00
David S. Miller
95337b9821 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Remove the broute pseudo hook, implement this from the bridge
   prerouting hook instead. Now broute becomes real table in ebtables,
   from Florian Westphal. This also includes a size reduction patch for the
   bridge control buffer area via squashing boolean into bitfields and
   a selftest.

2) Add OS passive fingerprint version matching, from Fernando Fernandez.

3) Support for gue encapsulation for IPVS, from Jacky Hu.

4) Add support for NAT to the inet family, from Florian Westphal.
   This includes support for masquerade, redirect and nat extensions.

5) Skip interface lookup in flowtable, use device in the dst object.

6) Add jiffies64_to_msecs() and use it, from Li RongQing.

7) Remove unused parameter in nf_tables_set_desc_parse(), from Colin Ian King.

8) Statify several functions, patches from YueHaibing and Florian Westphal.

9) Add an optimized version of nf_inet_addr_cmp(), from Li RongQing.

10) Merge route extension to core, also from Florian.

11) Use IS_ENABLED(CONFIG_NF_NAT) instead of NF_NAT_NEEDED, from Florian.

12) Merge ip/ip6 masquerade extensions, from Florian. This includes
    netdevice notifier unification.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-15 12:07:35 -07:00
Florian Westphal
3c79107631 netfilter: ctnetlink: don't use conntrack/expect object addresses as id
else, we leak the addresses to userspace via ctnetlink events
and dumps.

Compute an ID on demand based on the immutable parts of nf_conn struct.

Another advantage compared to using an address is that there is no
immediate re-use of the same ID in case the conntrack entry is freed and
reallocated again immediately.

Fixes: 3583240249 ("[NETFILTER]: nf_conntrack_expect: kill unique ID")
Fixes: 7f85f91472 ("[NETFILTER]: nf_conntrack: kill unique ID")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-15 07:31:44 +02:00
Florian Westphal
1025ce7521 netfilter: conntrack: don't set related state for different outer address
Luca Moro says:
 ------
The issue lies in the filtering of ICMP and ICMPv6 errors that include an
inner IP datagram.
For these packets, icmp_error_message() extract the ICMP error and inner
layer to search of a known state.
If a state is found the packet is tagged as related (IP_CT_RELATED).

The problem is that there is no correlation check between the inner and
outer layer of the packet.
So one can encapsulate an error with an inner layer matching a known state,
while its outer layer is directed to a filtered host.
In this case the whole packet will be tagged as related.
This has various implications from a rule bypass (if a rule to related
trafic is allow), to a known state oracle.

Unfortunately, we could not find a real statement in a RFC on how this case
should be filtered.
The closest we found is RFC5927 (Section 4.3) but it is not very clear.

A possible fix would be to check that the inner IP source is the same than
the outer destination.

We believed this kind of attack was not documented yet, so we started to
write a blog post about it.
You can find it attached to this mail (sorry for the extract quality).
It contains more technical details, PoC and discussion about the identified
behavior.
We discovered later that
https://www.gont.com.ar/papers/filtering-of-icmp-error-messages.pdf
described a similar attack concept in 2004 but without the stateful
filtering in mind.
 -----

This implements above suggested fix:
In icmp(v6) error handler, take outer destination address, then pass
that into the common function that does the "related" association.

After obtaining the nf_conn of the matching inner-headers connection,
check that the destination address of the opposite direction tuple
is the same as the outer address and only set RELATED if thats the case.

Reported-by: Luca Moro <luca.moro@synacktiv.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-13 14:52:57 +02:00
Marc Dionne
4611da30d6 rxrpc: Make rxrpc_kernel_check_life() indicate if call completed
Make rxrpc_kernel_check_life() pass back the life counter through the
argument list and return true if the call has not yet completed.

Suggested-by: Marc Dionne <marc.dionne@auristor.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-12 16:57:23 -07:00
David Miller
013b96ec64 sctp: Pass sk_buff_head explicitly to sctp_ulpq_tail_event().
Now the SKB list implementation assumption can be removed.

And now that we know that the list head is always non-NULL
we can remove the code blocks dealing with that as well.

Signed-off-by: David S. Miller <davem@davemloft.net>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 21:33:31 -07:00
Florian Westphal
971502d77f bridge: netfilter: unroll NF_HOOK helper in bridge input path
Replace NF_HOOK() based invocation of the netfilter hooks with a private
copy of nf_hook_slow().

This copy has one difference: it can return the rx handler value expected
by the stack, i.e. RX_HANDLER_CONSUMED or RX_HANDLER_PASS.

This is needed by the next patch to invoke the ebtables
"broute" table via the standard netfilter hooks rather than the custom
"br_should_route_hook" indirection that is used now.

When the skb is to be "brouted", we must return RX_HANDLER_PASS from the
bridge rx input handler, but there is no way to indicate this via
NF_HOOK(), unless perhaps by some hack such as exposing bridge_cb in the
netfilter core or a percpu flag.

  text    data     bss     dec   filename
  3369      56       0    3425   net/bridge/br_input.o.before
  3458      40       0    3498   net/bridge/br_input.o.after

This allows removal of the "br_should_route_hook" in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: David S. Miller <davem@davemloft.net>
Acked-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-12 01:47:39 +02:00
David Ahern
cc3a86c802 ipv6: Change rt6_probe to take a fib6_nh
rt6_probe sends probes for gateways in a nexthop. As such it really
depends on a fib6_nh, not a fib entry. Move last_probe to fib6_nh and
update rt6_probe to a fib6_nh struct.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 14:24:06 -07:00
YueHaibing
d3706566ae net: netrom: Fix error cleanup path of nr_proto_init
Syzkaller report this:

BUG: unable to handle kernel paging request at fffffbfff830524b
PGD 237fe8067 P4D 237fe8067 PUD 237e64067 PMD 1c9716067 PTE 0
Oops: 0000 [#1] SMP KASAN PTI
CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 5.0.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014
RIP: 0010:__list_add_valid+0x21/0xe0 lib/list_debug.c:23
Code: 8b 0c 24 e9 17 fd ff ff 90 55 48 89 fd 48 8d 7a 08 53 48 89 d3 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 0f 85 8b 00 00 00 48 8b 53 08 48 39 f2 75 35 48 89 f2
RSP: 0018:ffff8881ea2278d0 EFLAGS: 00010282
RAX: dffffc0000000000 RBX: ffffffffc1829250 RCX: 1ffff1103d444ef4
RDX: 1ffffffff830524b RSI: ffffffff85659300 RDI: ffffffffc1829258
RBP: ffffffffc1879250 R08: fffffbfff0acb269 R09: fffffbfff0acb269
R10: ffff8881ea2278f0 R11: fffffbfff0acb268 R12: ffffffffc1829250
R13: dffffc0000000000 R14: 0000000000000008 R15: ffffffffc187c830
FS:  00007fe0361df700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: fffffbfff830524b CR3: 00000001eb39a001 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
 __list_add include/linux/list.h:60 [inline]
 list_add include/linux/list.h:79 [inline]
 proto_register+0x444/0x8f0 net/core/sock.c:3375
 nr_proto_init+0x73/0x4b3 [netrom]
 ? 0xffffffffc1628000
 ? 0xffffffffc1628000
 do_one_initcall+0xbc/0x47d init/main.c:887
 do_init_module+0x1b5/0x547 kernel/module.c:3456
 load_module+0x6405/0x8c10 kernel/module.c:3804
 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898
 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x462e99
Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007fe0361dec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99
RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00007fe0361dec70 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0361df6bc
R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004
Modules linked in: netrom(+) ax25 fcrypt pcbc af_alg arizona_ldo1 v4l2_common videodev media v4l2_dv_timings hdlc ide_cd_mod snd_soc_sigmadsp_regmap snd_soc_sigmadsp intel_spi_platform intel_spi mtd spi_nor snd_usbmidi_lib usbcore lcd ti_ads7950 hi6421_regulator snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_rt5663 snd_soc_core snd_pcm_dmaengine snd_compress snd_soc_rl6231 mac80211 rtc_rc5t583 spi_slave_time leds_pwm hid_gt683r hid industrialio_triggered_buffer kfifo_buf industrialio ir_kbd_i2c rc_core led_class_flash dwc_xlgmac snd_ymfpci gameport snd_mpu401_uart snd_rawmidi snd_ac97_codec snd_pcm ac97_bus snd_opl3_lib snd_timer snd_seq_device snd_hwdep snd soundcore iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan
 bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ide_pci_generic piix aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ide_core psmouse input_leds i2c_piix4 serio_raw intel_agp intel_gtt ata_generic agpgart pata_acpi parport_pc rtc_cmos parport floppy sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: rxrpc]
Dumping ftrace buffer:
   (ftrace buffer empty)
CR2: fffffbfff830524b
---[ end trace 039ab24b305c4b19 ]---

If nr_proto_init failed, it may forget to call proto_unregister,
tiggering this issue.This patch rearrange code of nr_proto_init
to avoid such issues.

Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 13:59:49 -07:00
Florian Westphal
610a43149c netfilter: nf_nat_masquerade: unify ipv4/6 notifier registration
Only reason for having two different register functions was because of
ipt_MASQUERADE and ip6t_MASQUERADE being two different modules.

Previous patch merged those into xt_MASQUERADE, so we can merge this too.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:34 +02:00
Florian Westphal
bf8981a2aa netfilter: nf_nat: merge ip/ip6 masquerade headers
Both are now implemented by nf_nat_masquerade.c, so no need to keep
different headers.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-11 20:59:21 +02:00
Ursula Braun
07603b2308 net/smc: propagate file from SMC to TCP socket
fcntl(fd, F_SETOWN, getpid()) selects the recipient of SIGURG signals
that are delivered when out-of-band data arrives on socket fd.
If an SMC socket program makes use of such an fcntl() call, it fails
in case of fallback to TCP-mode. In case of fallback the traffic is
processed with the internal TCP socket. Propagating field "file" from the
SMC socket to the internal TCP socket fixes the issue.

Reviewed-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-11 11:04:08 -07:00
Jakub Kicinski
b4f47f3848 net/tls: prevent bad memory access in tls_is_sk_tx_device_offloaded()
Unlike '&&' operator, the '&' does not have short-circuit
evaluation semantics.  IOW both sides of the operator always
get evaluated.  Fix the wrong operator in
tls_is_sk_tx_device_offloaded(), which would lead to
out-of-bounds access for for non-full sockets.

Fixes: 4799ac81e5 ("tls: Add rx inline crypto offload")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 22:42:28 -07:00
Jakub Kicinski
35b71a34ad net/tls: don't leak partially sent record in device mode
David reports that tls triggers warnings related to
sk->sk_forward_alloc not being zero at destruction time:

WARNING: CPU: 5 PID: 6831 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110
WARNING: CPU: 5 PID: 6831 at net/ipv4/af_inet.c:160 inet_sock_destruct+0x15b/0x170

When sender fills up the write buffer and dies from
SIGPIPE.  This is due to the device implementation
not cleaning up the partially_sent_record.

This is because commit a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
moved the partial record cleanup to the SW-only path.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Reported-by: David Beckett <david.beckett@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 13:07:02 -07:00
Paolo Abeni
73eb628ddf Revert: "net: sched: put back q.qlen into a single location"
This revert commit 46b1c18f9d ("net: sched: put back q.qlen into
a single location").
After the previous patch, when a NOLOCK qdisc is enslaved to a
locking qdisc it switches to global stats accounting. As a consequence,
when a classful qdisc accesses directly a child qdisc's qlen, such
qdisc is not doing per CPU accounting and qlen value is consistent.

In the control path nobody uses directly qlen since commit
e5f0e8f8e4 ("net: sched: introduce and use qdisc tree flush/purge
helpers"), so we can remove the contented atomic ops from the
datapath.

v1 -> v2:
 - complete the qdisc_qstats_atomic_qlen_dec() ->
   qdisc_qstats_cpu_qlen_dec() replacement, fix build issue
 - more descriptive commit message

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Paolo Abeni
8a53e616de net: sched: when clearing NOLOCK, clear TCQ_F_CPUSTATS, too
Since stats updating is always consistent with TCQ_F_CPUSTATS flag,
we can disable it at qdisc creation time flipping such bit.

In my experiments, if the NOLOCK flag is cleared, per CPU stats
accounting does not give any measurable performance gain, but it
waste some memory.

Let's clear TCQ_F_CPUSTATS together with NOLOCK, when enslaving
a NOLOCK qdisc to 'lock' one.

Use stats update helper inside pfifo_fast, to cope correctly with
TCQ_F_CPUSTATS flag change.

As a side effect, q.qlen value for any child qdiscs is always
consistent for all lock classfull qdiscs.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Paolo Abeni
9c01c9f1f2 net: sched: always do stats accounting according to TCQ_F_CPUSTATS
The core sched implementation checks independently for NOLOCK flag
to acquire/release the root spin lock and for qdisc_is_percpu_stats()
to account per CPU values in many places.

This change update the last few places checking the TCQ_F_NOLOCK to
do per CPU stats accounting according to qdisc_is_percpu_stats()
value.

The above allows to clean dev_requeue_skb() implementation a bit
and makes stats update always consistent with a single flag.

v1 -> v2:
 - do not move qdisc_is_empty definition, fix build issue

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
Paolo Abeni
1f5e6fdd6a net: sched: prefer qdisc_is_empty() over direct qlen access
When checking for root qdisc queue length, do not access directly q.qlen.
In the following patches we will move back qlen accounting to per CPU
values for NOLOCK qdiscs.

Instead, prefer the qdisc_is_empty() helper usage.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-10 12:20:46 -07:00
David S. Miller
c03fd0171b Various fixes:
* iTXQ fixes from Felix
  * tracing fix - increase message length
  * fix SW_CRYPTO_CONTROL enforcement
  * WMM rule handling for regdomain intersection
  * max_interfaces in hwsim - reported by syzbot
  * clear private data in some more commands
  * a clang compiler warning fix
 
 I added a patch with two new (unused) macros for
 rate-limited printing to simplify getting the users
 into the tree.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlyshYoACgkQB8qZga/f
 l8Q1EQ//YO53GM06uvgdOlQkoF3D3uSLtPbJyxWcZCaJXjJ6jOOZsSrZZBLn7avf
 i80UziS34ep0zdRHyVmGe6Q/VqufWUnzd6KBNdINEDdepOw6jFlBe9Ue+7mGwBnl
 bqzJJ+rxQj1wVqMKEQMtznqCEX58ULiDKcPUPTG2Szw7moqMoTVaw7u2d3eX5mTS
 Ka/rdsA/V9v4VOkBi//JqYmlZjdq83el8r4rshI2YbNdxwljTchK5YDsJreDawzF
 X7pDjAou2EAskot2Pahz1qAEew0XUnSliTDRVTsgcsellnUHFsEGXytXQ7sp8Myy
 kpGuBdeBqJuPWwKTbHfxiQb1VudpBWuMB2x/rotszgzrqSoVWot5j6/UyDSvMPyC
 U/drsGJrN1iF/mYTn6G/r8sKPIaCxN7h02Z75IV7LBokGmz9Yt7srEjrOKt0uuK+
 l5nb/YQNcN37lkREa6zLzec0f3wDzoqw1GWL0IuQyb+cFC2+6q5iFQ8dyXnu3FdQ
 1cSJGJdnyG3bM0b42WrOcEOFt3EKmhb4UeuusoxkOd4aKuhBXRjVEzM+X6ADqTdR
 mWvk7bEfYD0UMsqhtvrYBdBfquPuy3S4QzA0IMzxjk8bhCug233KpOPcLgVFjlzh
 g7wCgfahl0TRhH7R1D3hnRUtcdsjifoa75XMAaDnWfnR9PiN6uI=
 =2EV9
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-04-09' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Various fixes:
 * iTXQ fixes from Felix
 * tracing fix - increase message length
 * fix SW_CRYPTO_CONTROL enforcement
 * WMM rule handling for regdomain intersection
 * max_interfaces in hwsim - reported by syzbot
 * clear private data in some more commands
 * a clang compiler warning fix

I added a patch with two new (unused) macros for
rate-limited printing to simplify getting the users
into the tree.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-09 10:57:10 -07:00
David S. Miller
310655b07a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-04-08 23:39:36 -07:00
David Ahern
d15662682d ipv4: Allow ipv6 gateway with ipv4 routes
Add support for RTA_VIA and allow an IPv6 nexthop for v4 routes:
   $ ip ro add 172.16.1.0/24 via inet6 2001:db8::1 dev eth0
   $ ip ro ls
   ...
   172.16.1.0/24 via inet6 2001:db8::1 dev eth0

For convenience and simplicity, userspace can use RTA_VIA to specify
AF_INET or AF_INET6 gateway.

The common fib_nexthop_info dump function compares the gateway address
family to the nh_common family to know if the gateway should be encoded
as RTA_VIA or RTA_GATEWAY.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
19a9d136f1 ipv4: Flag fib_info with a fib_nh using IPv6 gateway
Until support is added to the offload drivers, they need to be able to
reject routes with an IPv6 gateway. To that end add a flag to fib_info
that indicates if any fib_nh has a v6 gateway. The flag allows the drivers
to efficiently know the use of a v6 gateway without walking all fib_nh
tied to a fib_info each time a route is added.

Update mlxsw and rocker to reject the routes with extack message as to why.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
5c9f7c1dfc ipv4: Add helpers for neigh lookup for nexthop
A common theme in the output path is looking up a neigh entry for a
nexthop, either the gateway in an rtable or a fallback to the daddr
in the skb:

        nexthop = (__force u32)rt_nexthop(rt, ip_hdr(skb)->daddr);
        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
        if (unlikely(!neigh))
                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

To allow the nexthop to be an IPv6 address we need to consider the
family of the nexthop and then call __ipv{4,6}_neigh_lookup_noref based
on it.

To make this simpler, add a ip_neigh_gw4 helper similar to ip_neigh_gw6
added in an earlier patch which handles:

        neigh = __ipv4_neigh_lookup_noref(dev, nexthop);
        if (unlikely(!neigh))
                neigh = __neigh_create(&arp_tbl, &nexthop, dev, false);

And then add a second one, ip_neigh_for_gw, that calls either
ip_neigh_gw4 or ip_neigh_gw6 based on the address family of the gateway.

Update the output paths in the VRF driver and core v4 code to use
ip_neigh_for_gw simplifying the family based lookup and making both
ready for a v6 nexthop.

ipv4_neigh_lookup has a different need - the potential to resolve a
passed in address in addition to any gateway in the rtable or skb. Since
this is a one-off, add ip_neigh_gw4 and ip_neigh_gw6 diectly. The
difference between __neigh_create used by the helpers and neigh_create
called by ipv4_neigh_lookup is taking a refcount, so add rcu_read_lock_bh
and bump the refcnt on the neigh entry.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
0353f28231 neighbor: Add skip_cache argument to neigh_output
A later patch allows an IPv6 gateway with an IPv4 route. The neighbor
entry will exist in the v6 ndisc table and the cached header will contain
the ipv6 protocol which is wrong for an IPv4 packet. For an IPv4 packet to
use the v6 neighbor entry, neigh_output needs to skip the cached header
and just use the output callback for the neigh entry.

A future patchset can look at expanding the hh_cache to handle 2
protocols. For now, IPv6 gateways with an IPv4 route will take the
extra overhead of generating the header.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:41 -07:00
David Ahern
a4ea5d43c8 ipv4: Add support to fib_config for IPv6 gateway
Add support for an IPv6 gateway to fib_config. Since a gateway is either
IPv4 or IPv6, make it a union with fc_gw4 where fc_gw_family decides
which address is in use. Update current checks on family and gw4 to
handle ipv6 as well.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
0f5f7d7bf6 ipv4: Add support to rtable for ipv6 gateway
Add support for an IPv6 gateway to rtable. Since a gateway is either
IPv4 or IPv6, make it a union with rt_gw4 where rt_gw_family decides
which address is in use.

When dumping the route data, encode an ipv6 nexthop using RTA_VIA.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
f35b794b3b ipv4: Prepare fib_config for IPv6 gateway
Similar to rtable, fib_config needs to allow the gateway to be either an
IPv4 or an IPv6 address. To that end, rename fc_gw to fc_gw4 to mean an
IPv4 address and add fc_gw_family. Checks on 'is a gateway set' are changed
to see if fc_gw_family is set. In the process prepare the code for a
fc_gw_family == AF_INET6.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
1550c17193 ipv4: Prepare rtable for IPv6 gateway
To allow the gateway to be either an IPv4 or IPv6 address, remove
rt_uses_gateway from rtable and replace with rt_gw_family. If
rt_gw_family is set it implies rt_uses_gateway. Rename rt_gateway
to rt_gw4 to represent the IPv4 version.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
bdf0046771 net: Replace nhc_has_gw with nhc_gw_family
Allow the gateway in a fib_nh_common to be from a different address
family than the outer fib{6}_nh. To that end, replace nhc_has_gw with
nhc_gw_family and update users of nhc_has_gw to check nhc_gw_family.
Now nhc_family is used to know if the nh_common is part of a fib_nh
or fib6_nh (used for container_of to get to route family specific data),
and nhc_gw_family represents the address family for the gateway.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
71df5777aa ipv6: Add neighbor helpers that use the ipv6 stub
Add ipv6 helpers to handle ndisc references via the stub. Update
bpf_ipv6_fib_lookup to use __ipv6_neigh_lookup_noref_stub instead of
the open code ___neigh_lookup_noref with the stub.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
David Ahern
1aefd3de7b ipv6: Add fib6_nh_init and release to stubs
Add fib6_nh_init and fib6_nh_release to ipv6_stubs. If fib6_nh_init fails,
callers should not invoke fib6_nh_release, so there is no reason to have
a dummy stub for the IPv6 is not enabled case.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 15:22:40 -07:00
Florian Westphal
3b0a081db1 netfilter: make two functions static
They have no external callers anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:28:33 +02:00
Florian Westphal
4806e97572 netfilter: replace NF_NAT_NEEDED with IS_ENABLED(CONFIG_NF_NAT)
NF_NAT_NEEDED is true whenever nat support for either ipv4 or ipv6 is
enabled.  Now that the af-specific nat configuration switches have been
removed, IS_ENABLED(CONFIG_NF_NAT) has the same effect.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:02:52 +02:00
Florian Westphal
c1deb065cf netfilter: nf_tables: merge route type into core
very little code, so it really doesn't make sense to have extra
modules or even a kconfig knob for this.

Merge them and make functionality available unconditionally.
The merge makes inet family route support trivial, so add it
as well here.

Before:
   text	   data	    bss	    dec	    hex	filename
    835	    832	      0	   1667	    683 nft_chain_route_ipv4.ko
    870	    832	      0	   1702	    6a6	nft_chain_route_ipv6.ko
 111568	   2556	    529	 114653	  1bfdd	nf_tables.ko

After:
   text	   data	    bss	    dec	    hex	filename
 113133	   2556	    529	 116218	  1c5fa	nf_tables.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:01:42 +02:00
Florian Westphal
d164385ec5 netfilter: nat: add inet family nat support
We need minimal support from the nat core for this, as we do not
want to register additional base hooks.

When an inet hook is registered, interally register ipv4 and ipv6
hooks for them and unregister those when inet hooks are removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 23:01:39 +02:00
Jacky Hu
84c0d5e96f ipvs: allow tunneling with gue encapsulation
ipip packets are blocked in some public cloud environments, this patch
allows gue encapsulation with the tunneling method, which would make
tunneling working in those environments.

Signed-off-by: Jacky Hu <hengqing.hu@gmail.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-04-08 22:57:59 +02:00
Paolo Abeni
fd69c399c7 datagram: remove rendundant 'peeked' argument
After commit a297569fe0 ("net/udp: do not touch skb->peeked unless
really needed") the 'peeked' argument of __skb_try_recv_datagram()
and friends is always equal to !!'flags & MSG_PEEK'.

Since such argument is really a boolean info, and the callers have
already 'flags & MSG_PEEK' handy, we can remove it and clean-up the
code a bit.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-08 09:51:54 -07:00
Sunil Dutt
cb74e97758 cfg80211/nl80211: Offload OWE processing to user space in AP mode
This interface allows the host driver to offload OWE processing
to user space. This intends to support OWE (Opportunistic Wireless
Encryption) AKM by the drivers that implement SME but rely on the
user space for the cryptographic/OWE processing in AP mode. Such
drivers are not capable of processing/deriving the DH IE.

A new NL80211 command - NL80211_CMD_UPDATE_OWE_INFO is introduced
to send the request/event between the host driver and user space.

Driver shall provide the OWE info (MAC address and DH IE) of
the peer to user space for cryptographic processing of the DH IE
through the event. Accordingly, the user space shall update the
OWE info/DH IE to the driver.

Following is the sequence in AP mode for OWE authentication.

Driver passes the OWE info obtained from the peer in the
Association Request to the user space through the event
cfg80211_update_owe_info_event. User space shall process the
OWE info received and generate new OWE info. This OWE info is
passed to the driver through NL80211_CMD_UPDATE_OWE_INFO
request. Driver eventually uses this OWE info to send the
Association Response to the peer.

This OWE info in the command interface carries the IEs that include
PMKID of the peer if the PMKSA is still valid or an updated DH IE
for generating a new PMKSA with the peer.

Signed-off-by: Liangwei Dong <liangwei@codeaurora.org>
Signed-off-by: Sunil Dutt <usdutt@codeaurora.org>
Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[remove policy initialization - no longer exists]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 14:32:23 +02:00
Narayanraddi Masti
ab60633c71 mac80211: Add support for NL80211_STA_INFO_AIRTIME_LINK_METRIC
Add support for mesh airtime link metric attribute
NL80211_STA_INFO_AIRTIME_LINK_METRIC.

Signed-off-by: Narayanraddi Masti <team.nmasti@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:58:10 +02:00
vamsi krishna
1e1b11b6a1 nl80211/cfg80211: Specify band specific min RSSI thresholds with sched scan
This commit adds the support to specify the RSSI thresholds per
band for each match set. This enhances the current behavior which
specifies a single rssi_threshold across all the bands by
introducing the rssi_threshold_per_band. These per band rssi
thresholds are referred through NL80211_BAND_* (enum nl80211_band)
variables  as attribute types. Such attributes/values per each
band are nested through NL80211_ATTR_SCHED_SCAN_MIN_RSSI.
These band specific rssi thresholds shall take precedence over
the current rssi_thold per match set.
Drivers indicate this support through
%NL80211_EXT_FEATURE_SCHED_SCAN_BAND_SPECIFIC_RSSI_THOLD.
These per band rssi attributes/values does not specify
"default RSSI filter" as done by
NL80211_SCHED_SCAN_MATCH_ATTR_RSSI to stay backward compatible.
That said, these per band rssi values have to be specified for
the corresponding matchset.

Signed-off-by: vamsi krishna <vamsin@codeaurora.org>
Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[rebase on refactoring, add policy]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:44:40 +02:00
Felix Fietkau
2b4a669802 mac80211: make ieee80211_schedule_txq schedule empty TXQs
Currently there is no way for the driver to signal to mac80211 that it should
schedule a TXQ even if there are no packets on the mac80211 part of that queue.
This is problematic if the driver has an internal retry queue to deal with
software A-MPDU retry.

This patch changes the behavior of ieee80211_schedule_txq to always schedule
the queue, as its only user (ath9k) seems to expect such behavior already:
it calls this function on tx status and on powersave wakeup whenever its
internal retry queue is not empty.

Also add an extra argument to ieee80211_return_txq to get the same behavior.

This fixes an issue on ath9k where tx queues with packets to retry (and no
new packets in mac80211) would not get serviced.

Fixes: 89cea7493a ("ath9k: Switch to mac80211 TXQ scheduling and airtime APIs")
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:31:31 +02:00
Stanislaw Gruszka
a58d7525b8 cfg80211: add ratelimited variants of err and warn
wiphy_{err,warn}_ratelimited will be used by rt2x00

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-04-08 13:31:24 +02:00
Florian Westphal
c9500d7b7d xfrm: store xfrm_mode directly, not its address
This structure is now only 4 bytes, so its more efficient
to cache a copy rather than its address.

No significant size difference in allmodconfig vmlinux.

With non-modular kernel that has all XFRM options enabled, this
series reduces vmlinux image size by ~11kb. All xfrm_mode
indirections are gone and all modes are built-in.

before (ipsec-next master):
    text      data      bss         dec   filename
21071494   7233140 11104324    39408958   vmlinux.master

after this series:
21066448   7226772 11104324    39397544   vmlinux.patched

With allmodconfig kernel, the size increase is only 362 bytes,
even all the xfrm config options removed in this series are
modular.

before:
    text      data     bss      dec   filename
15731286   6936912 4046908 26715106   vmlinux.master

after this series:
15731492   6937068  4046908  26715468 vmlinux

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:28 +02:00
Florian Westphal
4c145dce26 xfrm: make xfrm modes builtin
after previous changes, xfrm_mode contains no function pointers anymore
and all modules defining such struct contain no code except an init/exit
functions to register the xfrm_mode struct with the xfrm core.

Just place the xfrm modes core and remove the modules,
the run-time xfrm_mode register/unregister functionality is removed.

Before:

    text    data     bss      dec filename
    7523     200    2364    10087 net/xfrm/xfrm_input.o
   40003     628     440    41071 net/xfrm/xfrm_state.o
15730338 6937080 4046908 26714326 vmlinux

    7389     200    2364    9953  net/xfrm/xfrm_input.o
   40574     656     440   41670  net/xfrm/xfrm_state.o
15730084 6937068 4046908 26714060 vmlinux

The xfrm*_mode_{transport,tunnel,beet} modules are gone.

v2: replace CONFIG_INET6_XFRM_MODE_* IS_ENABLED guards with CONFIG_IPV6
    ones rather than removing them.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:17 +02:00
Florian Westphal
733a5fac2f xfrm: remove afinfo pointer from xfrm_mode
Adds an EXPORT_SYMBOL for afinfo_get_rcu, as it will now be called from
ipv6 in case of CONFIG_IPV6=m.

This change has virtually no effect on vmlinux size, but it reduces
afinfo size and allows followup patch to make xfrm modes const.

v2: mark if (afinfo) tests as likely (Sabrina)
    re-fetch afinfo according to inner_mode in xfrm_prepare_input().

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:09 +02:00
Florian Westphal
1de7083006 xfrm: remove output2 indirection from xfrm_mode
similar to previous patch: no external module dependencies,
so we can avoid the indirection by placing this in the core.

This change removes the last indirection from xfrm_mode and the
xfrm4|6_mode_{beet,tunnel}.c modules contain (almost) no code anymore.

Before:
   text    data     bss     dec     hex filename
   3957     136       0    4093     ffd net/xfrm/xfrm_output.o
    587      44       0     631     277 net/ipv4/xfrm4_mode_beet.o
    649      32       0     681     2a9 net/ipv4/xfrm4_mode_tunnel.o
    625      44       0     669     29d net/ipv6/xfrm6_mode_beet.o
    599      32       0     631     277 net/ipv6/xfrm6_mode_tunnel.o
After:
   text    data     bss     dec     hex filename
   5359     184       0    5543    15a7 net/xfrm/xfrm_output.o
    171      24       0     195      c3 net/ipv4/xfrm4_mode_beet.o
    171      24       0     195      c3 net/ipv4/xfrm4_mode_tunnel.o
    172      24       0     196      c4 net/ipv6/xfrm6_mode_beet.o
    172      24       0     196      c4 net/ipv6/xfrm6_mode_tunnel.o

v2: fold the *encap_add functions into xfrm*_prepare_output
    preserve (move) output2 comment (Sabrina)
    use x->outer_mode->encap, not inner
    fix a build breakage on ppc (kbuild robot)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:15:02 +02:00
Florian Westphal
b3284df1c8 xfrm: remove input2 indirection from xfrm_mode
No external dependencies on any module, place this in the core.
Increase is about 1800 byte for xfrm_input.o.

The beet helpers get added to internal header, as they can be reused
from xfrm_output.c in the next patch (kernel contains several
copies of them in the xfrm{4,6}_mode_beet.c files).

Before:
   text    data     bss     dec filename
   5578     176    2364    8118 net/xfrm/xfrm_input.o
   1180      64       0    1244 net/ipv4/xfrm4_mode_beet.o
    171      40       0     211 net/ipv4/xfrm4_mode_transport.o
   1163      40       0    1203 net/ipv4/xfrm4_mode_tunnel.o
   1083      52       0    1135 net/ipv6/xfrm6_mode_beet.o
    172      40       0     212 net/ipv6/xfrm6_mode_ro.o
    172      40       0     212 net/ipv6/xfrm6_mode_transport.o
   1056      40       0    1096 net/ipv6/xfrm6_mode_tunnel.o

After:
   text    data     bss     dec filename
   7373     200    2364    9937 net/xfrm/xfrm_input.o
    587      44       0     631 net/ipv4/xfrm4_mode_beet.o
    171      32       0     203 net/ipv4/xfrm4_mode_transport.o
    649      32       0     681 net/ipv4/xfrm4_mode_tunnel.o
    625      44       0     669 net/ipv6/xfrm6_mode_beet.o
    172      32       0     204 net/ipv6/xfrm6_mode_ro.o
    172      32       0     204 net/ipv6/xfrm6_mode_transport.o
    599      32       0     631 net/ipv6/xfrm6_mode_tunnel.o

v2: pass inner_mode to xfrm_inner_mode_encap_remove to fix
    AF_UNSPEC selector breakage (bisected by Benedict Wong)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:55 +02:00
Florian Westphal
7613b92b1a xfrm: remove gso_segment indirection from xfrm_mode
These functions are small and we only have versions for tunnel
and transport mode for ipv4 and ipv6 respectively.

Just place the 'transport or tunnel' conditional in the protocol
specific function instead of using an indirection.

Before:
    3226       12       0     3238   net/ipv4/esp4_offload.o
    7004      492       0     7496   net/ipv4/ip_vti.o
    3339       12       0     3351   net/ipv6/esp6_offload.o
   11294      460       0    11754   net/ipv6/ip6_vti.o
    1180       72       0     1252   net/ipv4/xfrm4_mode_beet.o
     428       48       0      476   net/ipv4/xfrm4_mode_transport.o
    1271       48       0     1319   net/ipv4/xfrm4_mode_tunnel.o
    1083       60       0     1143   net/ipv6/xfrm6_mode_beet.o
     172       48       0      220   net/ipv6/xfrm6_mode_ro.o
     429       48       0      477   net/ipv6/xfrm6_mode_transport.o
    1164       48       0     1212   net/ipv6/xfrm6_mode_tunnel.o
15730428  6937008 4046908 26714344   vmlinux

After:
    3461       12       0     3473   net/ipv4/esp4_offload.o
    7000      492       0     7492   net/ipv4/ip_vti.o
    3574       12       0     3586   net/ipv6/esp6_offload.o
   11295      460       0    11755   net/ipv6/ip6_vti.o
    1180       64       0     1244   net/ipv4/xfrm4_mode_beet.o
     171       40       0      211   net/ipv4/xfrm4_mode_transport.o
    1163       40       0     1203   net/ipv4/xfrm4_mode_tunnel.o
    1083       52       0     1135   net/ipv6/xfrm6_mode_beet.o
     172       40       0      212   net/ipv6/xfrm6_mode_ro.o
     172       40       0      212   net/ipv6/xfrm6_mode_transport.o
    1056       40       0     1096   net/ipv6/xfrm6_mode_tunnel.o
15730424  6937008 4046908 26714340   vmlinux

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:47 +02:00
Florian Westphal
303c5fab12 xfrm: remove xmit indirection from xfrm_mode
There are only two versions (tunnel and transport). The ip/ipv6 versions
are only differ in sizeof(iphdr) vs ipv6hdr.

Place this in the core and use x->outer_mode->encap type to call the
correct adjustment helper.

Before:
   text   data    bss     dec      filename
15730311  6937008 4046908 26714227 vmlinux

After:
15730428  6937008 4046908 26714344 vmlinux

(about 117 byte increase)

v2: use family from x->outer_mode, not inner

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:34 +02:00
Florian Westphal
0c620e97b3 xfrm: remove output indirection from xfrm_mode
Same is input indirection.  Only exception: we need to export
xfrm_outer_mode_output for pktgen.

Increases size of vmlinux by about 163 byte:
Before:
   text    data     bss     dec      filename
15730208  6936948 4046908 26714064   vmlinux

After:
15730311  6937008 4046908 26714227   vmlinux

xfrm_inner_extract_output has no more external callers, make it static.

v2: add IS_ENABLED(IPV6) guard in xfrm6_prepare_output
    add two missing breaks in xfrm_outer_mode_output (Sabrina Dubroca)
    add WARN_ON_ONCE for 'call AF_INET6 related output function, but
    CONFIG_IPV6=n' case.
    make xfrm_inner_extract_output static

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:28 +02:00
Florian Westphal
c2d305e510 xfrm: remove input indirection from xfrm_mode
No need for any indirection or abstraction here, both functions
are pretty much the same and quite small, they also have no external
dependencies.

xfrm_prepare_input can then be made static.

With allmodconfig build, size increase of vmlinux is 25 byte:

Before:
   text   data     bss     dec      filename
15730207  6936924 4046908 26714039  vmlinux

After:
15730208  6936948 4046908 26714064 vmlinux

v2: Fix INET_XFRM_MODE_TRANSPORT name in is-enabled test (Sabrina Dubroca)
    change copied comment to refer to transport and network header,
    not skb->{h,nh}, which don't exist anymore. (Sabrina)
    make xfrm_prepare_input static (Eyal Birger)

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:14:21 +02:00
Florian Westphal
b262a69582 xfrm: place af number into xfrm_mode struct
This will be useful to know if we're supposed to decode ipv4 or ipv6.

While at it, make the unregister function return void, all module_exit
functions did just BUG(); there is never a point in doing error checks
if there is no way to handle such error.

Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-04-08 09:13:46 +02:00
Dan Carpenter
6491d69839 nfc: nci: Potential off by one in ->pipes[] array
This is similar to commit e285d5bfb7 ("NFC: Fix the number of pipes")
where we changed NFC_HCI_MAX_PIPES from 127 to 128.

As the comment next to the define explains, the pipe identifier is 7
bits long.  The highest possible pipe is 127, but the number of possible
pipes is 128.  As the code is now, then there is potential for an
out of bounds array access:

    net/nfc/nci/hci.c:297 nci_hci_cmd_received() warn: array off by one?
    'ndev->hci_dev->pipes[pipe]' '0-127 == 127'

Fixes: 11f54f2286 ("NFC: nci: Add HCI over NCI protocol support")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-06 15:05:07 -07:00
David S. Miller
f83f715195 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Minor comment merge conflict in mlx5.

Staging driver has a fixup due to the skb->xmit_more changes
in 'net-next', but was removed in 'net'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-05 14:14:19 -07:00
Jiri Pirko
7e1146e8c1 net: devlink: introduce devlink_compat_switch_id_get() helper
Introduce devlink_compat_switch_id_get() helper which fills up switch_id
according to passed netdev pointer. Call it directly from
dev_get_port_parent_id() as a fallback when ndo_get_port_parent_id
is not defined for given netdev.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
Jiri Pirko
bec5267cde net: devlink: extend port attrs for switch ID
Extend devlink_port_attrs_set() to pass switch ID for ports which are
part of switch and store it in port attrs. For other ports, this is
NULL.

Note that this allows the driver to group devlink ports into one or more
switches according to the actual topology.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
Jiri Pirko
407dd706fb net: devlink: convert devlink_port_attrs bools to bits
In order to save space in the struct, convert bools to bits.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-04 17:42:36 -07:00
David Ahern
c0a720770c ipv6: Flip to fib_nexthop_info
Export fib_nexthop_info and fib_add_nexthop for use by IPv6 code.
Remove rt6_nexthop_info and rt6_add_nexthop in favor of the IPv4
versions. Update fib_nexthop_info for IPv6 linkdown check and
RTA_GATEWAY for AF_INET6.

Signed-off-by: David Ahern <dsahern@gmail.com>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
David Ahern
eba618abac ipv4: Add fib_nh_common to fib_result
Most of the ipv4 code only needs data from fib_nh_common. Add
fib_nh_common selection to fib_result and update users to use it.

Right now, fib_nh_common in fib_result will point to a fib_nh struct
that is embedded within a fib_info:

        fib_info  --> fib_nh
                      fib_nh
                      ...
                      fib_nh
                        ^
    fib_result->nhc ----+

Later, nhc can point to a fib_nh within a nexthop struct:

        fib_info --> nexthop --> fib_nh
                                   ^
    fib_result->nhc ---------------+

or for a nexthop group:

        fib_info --> nexthop --> nexthop --> fib_nh
                                 nexthop --> fib_nh
                                 ...
                                 nexthop --> fib_nh
                                               ^
    fib_result->nhc ---------------------------+

In all cases nhsel within fib_result will point to which leg in the
multipath route is used.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-03 21:50:20 -07:00
Xiaofei Shen
a2c7023f70 net: dsa: read mac address from DT for slave device
Before creating a slave netdevice, get the mac address from DTS and
apply in case it is valid.

Signed-off-by: Xiaofei Shen <xiaofeis@codeaurora.org>
Signed-off-by: Vinod Koul <vkoul@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:57:32 -07:00
Paolo Abeni
e5f0e8f8e4 net: sched: introduce and use qdisc tree flush/purge helpers
The same code to flush qdisc tree and purge the qdisc queue
is duplicated in many places and in most cases it does not
respect NOLOCK qdisc: the global backlog len is used and the
per CPU values are ignored.

This change addresses the above, factoring-out the relevant
code and using the helpers introduced by the previous patch
to fetch the correct backlog len.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:50:13 -07:00
Paolo Abeni
5dd431b6b9 net: sched: introduce and use qstats read helpers
Classful qdiscs can't access directly the child qdiscs backlog
length: if such qdisc is NOLOCK, per CPU values should be
accounted instead.

Most qdiscs no not respect the above. As a result, qstats fetching
for most classful qdisc is currently incorrect: if the child qdisc is
NOLOCK, it always reports 0 len backlog.

This change introduces a pair of helpers to safely fetch
both backlog and qlen and use them in stats class dumping
functions, fixing the above issue and cleaning a bit the code.

DRR needs also to access the child qdisc queue length, so it
needs custom handling.

Fixes: c5ad119fb6 ("net: sched: pfifo_fast use skb_array")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 14:50:13 -07:00
Stephen Suryaputra
8c83f2df9c vrf: check accept_source_route on the original netdevice
Configuration check to accept source route IP options should be made on
the incoming netdevice when the skb->dev is an l3mdev master. The route
lookup for the source route next hop also needs the incoming netdev.

v2->v3:
- Simplify by passing the original netdevice down the stack (per David
  Ahern).

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-04-01 10:44:58 -07:00
David Ahern
3616d08bcb ipv6: Move ipv6 stubs to a separate header file
The number of stubs is growing and has nothing to do with addrconf.
Move the definition of the stubs to a separate header file and update
users. In the move, drop the vxlan specific comment before ipv6_stub.

Code move only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:53:45 -07:00
David Ahern
979e276ebe net: Use common nexthop init and release helpers
With fib_nh_common in place, move common initialization and release
code into helpers used by both ipv4 and ipv6. For the moment, the init
is just the lwt encap and the release is both the netdev reference and
the the lwt state reference. More will be added later.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
f1741730dd net: Add fib_nh_common and update fib_nh and fib6_nh
Add fib_nh_common struct with common nexthop attributes. Convert
fib_nh and fib6_nh to use it. Use macros to move existing
fib_nh_* references to the new nh_common.nhc_*.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
ad1601ae02 ipv6: Rename fib6_nh entries
Rename fib6_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, gateway, flags, and lwtstate are common
with all nexthop definitions. In some places new temporary variables
are declared or local variables renamed to maintain line lengths.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
b75ed8b1aa ipv4: Rename fib_nh entries
Rename fib_nh entries that will be moved to a fib_nh_common struct.
Specifically, the device, oif, gateway, flags, scope, lwtstate,
nh_weight and nh_upper_bound are common with all nexthop definitions.
In the process shorten fib_nh_lwtstate to fib_nh_lws to avoid really
long lines.

Rename only; no functional change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
6d3d07b45c ipv6: Refactor fib6_ignore_linkdown
fib6_ignore_linkdown takes a fib6_info but only looks at the net_device
and its IPv6 config. Change it to take a net_device over a fib6_info as
its input argument.

In addition, move it to a header file to make the check inline and usable
later with IPv4 code without going through the ipv6 stub, and rename to
ip6_ignore_linkdown since it is only checking the setting based on the
ipv6 struct on a device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:04 -07:00
David Ahern
2b2450ca4a ipv6: Move gateway checks to a fib6_nh setting
The gateway setting is not per fib6_info entry but per-fib6_nh. Add a new
fib_nh_has_gw flag to fib6_nh and convert references to RTF_GATEWAY to
the new flag. For IPv6 address the flag is cheaper than checking that
nh_gw is non-0 like IPv4 does.

While this increases fib6_nh by 8-bytes, the effective allocation size of
a fib6_info is unchanged. The 8 bytes is recovered later with a
fib_nh_common change.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
dac7d0f270 ipv6: Create cleanup helper for fib6_nh
Move the fib6_nh cleanup code to a new helper, fib6_nh_release.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
83c4425159 ipv6: Create init helper for fib6_nh
Similar to IPv4, consolidate the fib6_nh initialization into a helper.
As a new standalone function, add a cleanup path to put lwtstate on
error.

To avoid modifying fib6_config flags, move the reject check to a helper
that is invoked once by fib6_nh_init to reset the device and then
again in ip6_route_info_create to set the fib6_flags.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
faa041a40b ipv4: Create cleanup helper for fib_nh
Move the fib_nh cleanup code from free_fib_info_rcu into a new helper,
fib_nh_release. Move classid accounting into fib_nh_release which is
called per fib_nh to make accounting symmetrical with fib_nh_init.
Export the helper to allow for use with nexthop objects in the
future.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
David Ahern
e4516ef654 ipv4: Create init helper for fib_nh
Consolidate the fib_nh initialization which is duplicated between
fib_create_info for single path and fib_get_nhs for multipath.
Export the helper to allow for use with nexthop objects in the
future.

Signed-off-by: David Ahern <dsahern@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-29 10:48:03 -07:00
Felix Fietkau
5b989c18da mac80211: rework locking for txq scheduling / airtime fairness
Holding the lock around the entire duration of tx scheduling can create
some nasty lock contention, especially when processing airtime information
from the tx status or the rx path.
Improve locking by only holding the active_txq_lock for lookups / scheduling
list modifications.

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Acked-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-03-29 11:27:44 +01:00
Eric Dumazet
355b985537 netns: provide pure entropy for net_hash_mix()
net_hash_mix() currently uses kernel address of a struct net,
and is used in many places that could be used to reveal this
address to a patient attacker, thus defeating KASLR, for
the typical case (initial net namespace, &init_net is
not dynamically allocated)

I believe the original implementation tried to avoid spending
too many cycles in this function, but security comes first.

Also provide entropy regardless of CONFIG_NET_NS.

Fixes: 0b4419162a ("netns: introduce the net_hash_mix "salt" for hashes")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Cc: Pavel Emelyanov <xemul@openvz.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 17:00:45 -07:00
Yi-Hung Wei
717700d183 netfilter: Export nf_ct_{set,destroy}_timeout()
This patch exports nf_ct_set_timeout() and nf_ct_destroy_timeout().
The two functions are derived from xt_ct_destroy_timeout() and
xt_ct_set_timeout() in xt_CT.c, and moved to nf_conntrack_timeout.c
without any functional change.
It would be useful for other users (i.e. OVS) that utilizes the
finer-grain conntrack timeout feature.

CC: Pablo Neira Ayuso <pablo@netfilter.org>
CC: Pravin Shelar <pshelar@ovn.org>
Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 16:53:29 -07:00
Jiri Pirko
14c03ac4c1 net: devlink: remove unused devlink_port_get_phys_port_name() function
Now it is unused, remove it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
Jiri Pirko
af3836df9a net: devlink: introduce devlink_compat_phys_port_name_get()
Introduce devlink_compat_phys_port_name_get() helper that
gets the physical port name for specified netdevice
according to devlink port attributes.
Call this helper from dev_get_phys_port_name()
in case ndo_get_phys_port_name is not defined.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
Jiri Pirko
5dc37bb9b0 net: replace ndo_get_devlink with ndo_get_devlink_port
Follow-up patch is going to need a devlink port instance according to
a netdev. Devlink port instance should be always available when devlink
is used. So change the recently introduced ndo_get_devlink to
ndo_get_devlink_port. With that, adjust the wrapper for the only
user to get devlink pointer.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-28 12:55:30 -07:00
David S. Miller
356d71e00d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-27 17:37:58 -07:00
Eric Dumazet
df453700e8 inet: switch IP ID generator to siphash
According to Amit Klein and Benny Pinkas, IP ID generation is too weak
and might be used by attackers.

Even with recent net_hash_mix() fix (netns: provide pure entropy for net_hash_mix())
having 64bit key and Jenkins hash is risky.

It is time to switch to siphash and its 128bit keys.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Amit Klein <aksecurity@gmail.com>
Reported-by: Benny Pinkas <benny@pinkas.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 14:29:26 -07:00
Eric Dumazet
4f661542a4 tcp: fix zerocopy and notsent_lowat issues
My recent patch had at least three problems :

1) TX zerocopy wants notification when skb is acknowledged,
   thus we need to call skb_zcopy_clear() if the skb is
   cached into sk->sk_tx_skb_cache

2) Some applications might expect precise EPOLLOUT
   notifications, so we need to update sk->sk_wmem_queued
   and call sk_mem_uncharge() from sk_wmem_free_skb()
   in all cases. The SOCK_QUEUE_SHRUNK flag must also be set.

3) Reuse of saved skb should have used skb_cloned() instead
  of simply checking if the fast clone has been freed.

Fixes: 472c2e07ee ("tcp: add one skb cache for tx")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Willem de Bruijn <willemb@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Tested-by: Holger Hoffstätte <holger@applied-asynchrony.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-27 13:59:02 -07:00
Martin Willi
025c65e119 xfrm: Honor original L3 slave device in xfrmi policy lookup
If an xfrmi is associated to a vrf layer 3 master device,
xfrm_policy_check() fails after traffic decapsulation. The input
interface is replaced by the layer 3 master device, and hence
xfrmi_decode_session() can't match the xfrmi anymore to satisfy
policy checking.

Extend ingress xfrmi lookup to honor the original layer 3 slave
device, allowing xfrm interfaces to operate within a vrf domain.

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Martin Willi <martin@strongswan.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-27 16:14:05 +01:00
Cong Wang
dbb2483b2a xfrm: clean up xfrm protocol checks
In commit 6a53b75932 ("xfrm: check id proto in validate_tmpl()")
I introduced a check for xfrm protocol, but according to Herbert
IPSEC_PROTO_ANY should only be used as a wildcard for lookup, so
it should be removed from validate_tmpl().

And, IPSEC_PROTO_ANY is expected to only match 3 IPSec-specific
protocols, this is why xfrm_state_flush() could still miss
IPPROTO_ROUTING, which leads that those entries are left in
net->xfrm.state_all before exit net. Fix this by replacing
IPSEC_PROTO_ANY with zero.

This patch also extracts the check from validate_tmpl() to
xfrm_id_proto_valid() and uses it in parse_ipsecrequest().
With this, no other protocols should be added into xfrm.

Fixes: 6a53b75932 ("xfrm: check id proto in validate_tmpl()")
Reported-by: syzbot+0bf0519d6e0de15914fe@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-26 08:35:36 +01:00
Jiri Pirko
f6b19b354d net: devlink: select NET_DEVLINK from drivers
Some drivers are becoming more dependent on NET_DEVLINK being selected
in configuration. With upcoming compat functions, the behavior would be
wrong in case devlink was not compiled in. So make the drivers select
NET_DEVLINK and rely on the functions being there, not just stubs.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
Jiri Pirko
b8f975545c net: devlink: add port type spinlock
Add spinlock to protect port type and type_dev pointer consistency.
Without that, userspace may see inconsistent type and type_dev
combinations.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
v1->v2:
- rebased
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-24 14:55:31 -04:00
David S. Miller
d64fee0a03 mlx5-updates-2019-03-20
This series includes updates to mlx5 driver,
 
 1) Compiler warnings cleanup from Saeed Mahameed
 2) Parav Pandit simplifies sriov enable/disables
 3) Gustavo A. R. Silva, Removes a redundant assignment
 4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
 5) Eli Britstein, Adds the Support for VLAN modify action and
    Replaces TC VLAN pop and push actions with VLAN modify
 
 Note: This series includes two simple non-mlx5 patches,
 
 1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
 and use it in some drivers.
 2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
 and use it in mlx5 and nfp drivers.
 -----BEGIN PGP SIGNATURE-----
 
 iQEcBAABAgAGBQJclTLsAAoJEEg/ir3gV/o+7bEH/1sz4oKP2mfhKSbG/I/g7Q3D
 ifnccYq2EyXd1HzeglXpzLndO8wPve9qr/ANKrrKIYYCxc8FpCdb4aJD1Ucuylbb
 XHHdfbTIPMa3vjhKtR/Fydht4RkY5IBBsgXywBcNL3ofxmnleNt9JRSr76Yhr2sy
 Q3H30X+UvwAAQJBY1X+P8RiJcSklLu0UPG2KtTXcCz8YRgOWK0JtEiQyQu6yET4u
 zbVxYixwKgsR9uhwNXqLxVMsaWFue9cYmVSMLigDx7fRZvj6Ao9REEUflt1hCEoR
 jOXm1Avnsg9TKnwmgiBjrWQQQ4h+IMfZLK8EtuxVcraBUjtQRVnPak5JjZMjDuc=
 =7t4R
 -----END PGP SIGNATURE-----

Merge tag 'mlx5-updates-2019-03-20' of git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux

Saeed Mahameed says:

====================
mlx5-updates-2019-03-20

This series includes updates to mlx5 driver,

1) Compiler warnings cleanup from Saeed Mahameed
2) Parav Pandit simplifies sriov enable/disables
3) Gustavo A. R. Silva, Removes a redundant assignment
4) Moshe Shemesh, Adds Geneve tunnel stateless offload support
5) Eli Britstein, Adds the Support for VLAN modify action and
   Replaces TC VLAN pop and push actions with VLAN modify

Note: This series includes two simple non-mlx5 patches,

1) Declare IANA_VXLAN_UDP_PORT definition in include/net/vxlan.h,
and use it in some drivers.
2) Declare GENEVE_UDP_PORT definition in include/net/geneve.h,
and use it in mlx5 and nfp drivers.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 22:03:44 -04:00
Eric Dumazet
8b27dae5a2 tcp: add one skb cache for rx
Often times, recvmsg() system calls and BH handling for a particular
TCP socket are done on different cpus.

This means the incoming skb had to be allocated on a cpu,
but freed on another.

This incurs a high spinlock contention in slab layer for small rpc,
but also a high number of cache line ping pongs for larger packets.

A full size GRO packet might use 45 page fragments, meaning
that up to 45 put_page() can be involved.

More over performing the __kfree_skb() in the recvmsg() context
adds a latency for user applications, and increase probability
of trapping them in backlog processing, since the BH handler
might found the socket owned by the user.

This patch, combined with the prior one increases the rpc
performance by about 10 % on servers with large number of cores.

(tcp_rr workload with 10,000 flows and 112 threads reach 9 Mpps
 instead of 8 Mpps)

This also increases single bulk flow performance on 40Gbit+ links,
since in this case there are often two cpus working in tandem :

 - CPU handling the NIC rx interrupts, feeding the receive queue,
  and (after this patch) freeing the skbs that were consumed.

 - CPU in recvmsg() system call, essentially 100 % busy copying out
  data to user space.

Having at most one skb in a per-socket cache has very little risk
of memory exhaustion, and since it is protected by socket lock,
its management is essentially free.

Note that if rps/rfs is used, we do not enable this feature, because
there is high chance that the same cpu is handling both the recvmsg()
system call and the TCP rx path, but that another cpu did the skb
allocations in the device driver right before the RPS/RFS logic.

To properly handle this case, it seems we would need to record
on which cpu skb was allocated, and use a different channel
to give skbs back to this cpu.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet
472c2e07ee tcp: add one skb cache for tx
On hosts with a lot of cores, RPC workloads suffer from heavy contention on slab spinlocks.

    20.69%  [kernel]       [k] queued_spin_lock_slowpath
     5.64%  [kernel]       [k] _raw_spin_lock
     3.83%  [kernel]       [k] syscall_return_via_sysret
     3.48%  [kernel]       [k] __entry_text_start
     1.76%  [kernel]       [k] __netif_receive_skb_core
     1.64%  [kernel]       [k] __fget

For each sendmsg(), we allocate one skb, and free it at the time ACK packet comes.

In many cases, ACK packets are handled by another cpus, and this unfortunately
incurs heavy costs for slab layer.

This patch uses an extra pointer in socket structure, so that we try to reuse
the same skb and avoid these expensive costs.

We cache at most one skb per socket so this should be safe as far as
memory pressure is concerned.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Eric Dumazet
dc05360fee net: convert rps_needed and rfs_needed to new static branch api
We prefer static_branch_unlikely() over static_key_false() these days.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:57:38 -04:00
Paolo Abeni
28cff537ef net: sched: add empty status flag for NOLOCK qdisc
The queue is marked not empty after acquiring the seqlock,
and it's up to the NOLOCK qdisc clearing such flag on dequeue.
Since the empty status lays on the same cache-line of the
seqlock, it's always hot on cache during the updates.

This makes the empty flag update a little bit loosy. Given
the lack of synchronization between enqueue and dequeue, this
is unavoidable.

v2 -> v3:
 - qdisc_is_empty() has a const argument (Eric)

v1 -> v2:
 - use really an 'empty' flag instead of 'not_empty', as
   suggested by Eric

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Ivan Vecera <ivecera@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-23 21:52:36 -04:00
Moshe Shemesh
bea964107f net: Add IANA_VXLAN_UDP_PORT definition to vxlan header file
Added IANA_VXLAN_UDP_PORT (4789) definition to vxlan header file so it
can be used by drivers instead of local definition.
Updated drivers which locally defined it as 4789 to use it.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: John Hurley <john.hurley@netronome.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Cc: Yunsheng Lin <linyunsheng@huawei.com>
Cc: Peng Li <lipeng321@huawei.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-22 12:09:31 -07:00
Moshe Shemesh
974eff2b57 net: Move the definition of the default Geneve udp port to public header file
Move the definition of the default Geneve udp port from the geneve
source to the header file, so we can re-use it from drivers.
Modify existing drivers to use it.

Signed-off-by: Moshe Shemesh <moshe@mellanox.com>
Reviewed-by: Or Gerlitz <ogerlitz@mellanox.com>
Cc: John Hurley <john.hurley@netronome.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2019-03-22 12:09:31 -07:00
Johannes Berg
3b0f31f2b8 genetlink: make policy common to family
Since maxattr is common, the policy can't really differ sanely,
so make it common as well.

The only user that did in fact manage to make a non-common policy
is taskstats, which has to be really careful about it (since it's
still using a common maxattr!). This is no longer supported, but
we can fake it using pre_doit.

This reduces the size of e.g. nl80211.o (which has lots of commands):

   text	   data	    bss	    dec	    hex	filename
 398745	  14323	   2240	 415308	  6564c	net/wireless/nl80211.o (before)
 397913	  14331	   2240	 414484	  65314	net/wireless/nl80211.o (after)
--------------------------------
   -832      +8       0    -824

Which is obviously just 8 bytes for each command, and an added 8
bytes for the new policy pointer. I'm not sure why the ops list is
counted as .text though.

Most of the code transformations were done using the following spatch:
    @ops@
    identifier OPS;
    expression POLICY;
    @@
    struct genl_ops OPS[] = {
    ...,
     {
    -	.policy = POLICY,
     },
    ...
    };

    @@
    identifier ops.OPS;
    expression ops.POLICY;
    identifier fam;
    expression M;
    @@
    struct genl_family fam = {
            .ops = OPS,
            .maxattr = M,
    +       .policy = POLICY,
            ...
    };

This also gets rid of devlink_nl_cmd_region_read_dumpit() accessing
the cb->data as ops, which we want to change in a later genl patch.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-22 10:38:23 -04:00
Thomas Gleixner
671422b220 xfrm: Replace hrtimer tasklet with softirq hrtimer
Switch the timer to HRTIMER_MODE_SOFT, which executed the timer
callback in softirq context and remove the hrtimer_tasklet.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Signed-off-by: Anna-Maria Gleixner <anna-maria@linutronix.de>
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: David S. Miller <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Link: https://lkml.kernel.org/r/20190301224821.29843-3-bigeasy@linutronix.de
2019-03-22 14:35:31 +01:00
Julian Wiedmann
02afc7ad45 net: dst: remove gc leftovers
Get rid of some obsolete gc-related documentation and macros that were
missed in commit 5b7c9a8ff8 ("net: remove dst gc related code").

CC: Wei Wang <weiwan@google.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21 13:39:25 -07:00
David Ahern
9ab948a91b ipv4: Allow amount of dirty memory from fib resizing to be controllable
fib_trie implementation calls synchronize_rcu when a certain amount of
pages are dirty from freed entries. The number of pages was determined
experimentally in 2009 (commit c3059477fc).

At the current setting, synchronize_rcu is called often -- 51 times in a
second in one test with an average of an 8 msec delay adding a fib entry.
The total impact is a lot of slow down modifying the fib. This is seen
in the output of 'time' - the difference between real time and sys+user.
For example, using 720,022 single path routes and 'ip -batch'[1]:

    $ time ./ip -batch ipv4/routes-1-hops
    real    0m14.214s
    user    0m2.513s
    sys     0m6.783s

So roughly 35% of the actual time to install the routes is from the ip
command getting scheduled out, most notably due to synchronize_rcu (this
is observed using 'perf sched timehist').

This patch makes the amount of dirty memory configurable between 64k where
the synchronize_rcu is called often (small, low end systems that are memory
sensitive) to 64M where synchronize_rcu is called rarely during a large
FIB change (for high end systems with lots of memory). The default is 512kB
which corresponds to the current setting of 128 pages with a 4kB page size.

As an example, at 16MB the worst interval shows 4 calls to synchronize_rcu
in a second blocking for up to 30 msec in a single instance, and a total
of almost 100 msec across the 4 calls in the second. The trade off is
allowing FIB entries to consume more memory in a given time window but
but with much better fib insertion rates (~30% increase in prefixes/sec).
With this patch and net.ipv4.fib_sync_mem set to 16MB, the same batch
file runs in:

    $ time ./ip -batch ipv4/routes-1-hops
    real    0m9.692s
    user    0m2.491s
    sys     0m6.769s

So the dead time is reduced to about 1/2 second or <5% of the real time.

[1] 'ip' modified to not request ACK messages which improves route
    insertion times by about 20%

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21 13:29:53 -07:00
Davide Caratti
ee3bbfe806 net/sched: let actions use RCU to access 'goto_chain'
use RCU when accessing the action chain, to avoid use after free in the
traffic path when 'goto chain' is replaced on existing TC actions (see
script below). Since the control action is read in the traffic path
without holding the action spinlock, we need to explicitly ensure that
a->goto_chain is not NULL before dereferencing (i.e it's not sufficient
to rely on the value of TC_ACT_GOTO_CHAIN bits). Not doing so caused NULL
dereferences in tcf_action_goto_chain_exec() when the following script:

 # tc chain add dev dd0 chain 42 ingress protocol ip flower \
 > ip_proto udp action pass index 4
 # tc filter add dev dd0 ingress protocol ip flower \
 > ip_proto udp action csum udp goto chain 42 index 66
 # tc chain del dev dd0 chain 42 ingress
 (start UDP traffic towards dd0)
 # tc action replace action csum udp pass index 66

was run repeatedly for several hours.

Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Suggested-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21 13:26:42 -07:00
Davide Caratti
fe384e2fa3 net/sched: don't dereference a->goto_chain to read the chain index
callers of tcf_gact_goto_chain_index() can potentially read an old value
of the chain index, or even dereference a NULL 'goto_chain' pointer,
because 'goto_chain' and 'tcfa_action' are read in the traffic path
without caring of concurrent write in the control path. The most recent
value of chain index can be read also from a->tcfa_action (it's encoded
there together with TC_ACT_GOTO_CHAIN bits), so we don't really need to
dereference 'goto_chain': just read the chain id from the control action.

Fixes: e457d86ada ("net: sched: add couple of goto_chain helpers")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21 13:26:42 -07:00
Davide Caratti
85d0966fa5 net/sched: prepare TC actions to properly validate the control action
- pass a pointer to struct tcf_proto in each actions's init() handler,
  to allow validating the control action, checking whether the chain
  exists and (eventually) refcounting it.
- remove code that validates the control action after a successful call
  to the action's init() handler, and replace it with a test that forbids
  addition of actions having 'goto_chain' and NULL goto_chain pointer at
  the same time.
- add tcf_action_check_ctrlact(), that will validate the control action
  and eventually allocate the action 'goto_chain' within the init()
  handler.
- add tcf_action_set_ctrlact(), that will assign the control action and
  swap the current 'goto_chain' pointer with the new given one.

This disallows 'goto_chain' on actions that don't initialize it properly
in their init() handler, i.e. calling tcf_action_check_ctrlact() after
successful IDR reservation and then calling tcf_action_set_ctrlact()
to assign 'goto_chain' and 'tcf_action' consistently.

By doing this, the kernel does not leak anymore refcounts when a valid
'goto chain' handle is replaced in TC actions, causing kmemleak splats
like the following one:

 # tc chain add dev dd0 chain 42 ingress protocol ip flower \
 > ip_proto tcp action drop
 # tc chain add dev dd0 chain 43 ingress protocol ip flower \
 > ip_proto udp action drop
 # tc filter add dev dd0 ingress matchall \
 > action gact goto chain 42 index 66
 # tc filter replace dev dd0 ingress matchall \
 > action gact goto chain 43 index 66
 # echo scan >/sys/kernel/debug/kmemleak
 <...>
 unreferenced object 0xffff93c0ee09f000 (size 1024):
 comm "tc", pid 2565, jiffies 4295339808 (age 65.426s)
 hex dump (first 32 bytes):
   00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
   00 00 00 00 08 00 06 00 00 00 00 00 00 00 00 00  ................
 backtrace:
   [<000000009b63f92d>] tc_ctl_chain+0x3d2/0x4c0
   [<00000000683a8d72>] rtnetlink_rcv_msg+0x263/0x2d0
   [<00000000ddd88f8e>] netlink_rcv_skb+0x4a/0x110
   [<000000006126a348>] netlink_unicast+0x1a0/0x250
   [<00000000b3340877>] netlink_sendmsg+0x2c1/0x3c0
   [<00000000a25a2171>] sock_sendmsg+0x36/0x40
   [<00000000f19ee1ec>] ___sys_sendmsg+0x280/0x2f0
   [<00000000d0422042>] __sys_sendmsg+0x5e/0xa0
   [<000000007a6c61f9>] do_syscall_64+0x5b/0x180
   [<00000000ccd07542>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
   [<0000000013eaa334>] 0xffffffffffffffff

Fixes: db50514f9a ("net: sched: add termination action to allow goto chain")
Fixes: 97763dc0f4 ("net_sched: reject unknown tcfa_action values")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21 13:26:41 -07:00
David Ahern
c7a1ce397a ipv6: Change addrconf_f6i_alloc to use ip6_route_info_create
Change addrconf_f6i_alloc to generate a fib6_config and call
ip6_route_info_create. addrconf_f6i_alloc is the last caller to
fib6_info_alloc besides ip6_route_info_create, and there is no
reason for it to do its own initialization on a fib6_info.

Host routes need to be created even if the device is down, so add a
new flag, fc_ignore_dev_down, to fib6_config and update fib6_nh_init
to not error out if device is not up.

Notes on the conversion:
- ip_fib_metrics_init is the same as fib6_config has fc_mx set to NULL
  and fc_mx_len set to 0
- dst_nocount is handled by the RTF_ADDRCONF flag
- dst_host is handled by fc_dst_len = 128

nh_gw does not get set after the conversion to ip6_route_info_create
but it should not be set in addrconf_f6i_alloc since this is a host
route not a gateway route.

Everything else is a straight forward map between fib6_info and
fib6_config.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-21 10:16:54 -07:00
Stephen Suryaputra
0b03a5ca8b ipv6: Add icmp_echo_ignore_anycast for ICMPv6
In addition to icmp_echo_ignore_multicast, there is a need to also
prevent responding to pings to anycast addresses for security.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20 16:29:37 -07:00
Vakul Garg
f295b3ae9f net/tls: Add support of AES128-CCM based ciphers
Added support for AES128-CCM based record encryption. AES128-CCM is
similar to AES128-GCM. Both of them have same salt/iv/mac size. The
notable difference between the two is that while invoking AES128-CCM
operation, the salt||nonce (which is passed as IV) has to be prefixed
with a hardcoded value '2'. Further, CCM implementation in kernel
requires IV passed in crypto_aead_request() to be full '16' bytes.
Therefore, the record structure 'struct tls_rec' has been modified to
reserve '16' bytes for IV. This works for both GCM and CCM based cipher.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-20 11:02:05 -07:00
Steffen Klassert
bfc01ddff2 Revert "net: xfrm: Add '_rcu' tag for rcu protected pointer in netns_xfrm"
This reverts commit f10e0010fa.

This commit was just wrong. It caused a lot of
syzbot warnings, so just revert it.

Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-20 17:56:13 +01:00
Stephen Suryaputra
03f1eccc7a ipv6: Add icmp_echo_ignore_multicast support for ICMPv6
IPv4 has icmp_echo_ignore_broadcast to prevent responding to broadcast pings.
IPv6 needs a similar mechanism.

v1->v2:
- Remove NET_IPV6_ICMP_ECHO_IGNORE_MULTICAST.

Signed-off-by: Stephen Suryaputra <ssuryaextr@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 14:29:51 -07:00
Guillaume Nault
9403cf2302 tcp: free request sock directly upon TFO or syncookies error
Since the request socket is created locally, it'd make more sense to
use reqsk_free() instead of reqsk_put() in TFO and syncookies' error
path.

However, tcp_get_cookie_sock() may set ->rsk_refcnt before freeing the
socket; tcp_conn_request() may also have non-null ->rsk_refcnt because
of tcp_try_fastopen(). In both cases 'req' hasn't been exposed
to the outside world and is safe to free immediately, but that'd
trigger the WARN_ON_ONCE in reqsk_free().

Define __reqsk_free() for these situations where we know nobody's
referencing the socket, even though ->rsk_refcnt might be non-null.
Now we can consolidate the error path of tcp_get_cookie_sock() and
tcp_conn_request().

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-19 14:13:01 -07:00
Xin Long
273160ffc6 sctp: get sctphdr by offset in sctp_compute_cksum
sctp_hdr(skb) only works when skb->transport_header is set properly.

But in Netfilter, skb->transport_header for ipv6 is not guaranteed
to be right value for sctphdr. It would cause to fail to check the
checksum for sctp packets.

So fix it by using offset, which is always right in all places.

v1->v2:
  - Fix the changelog.

Fixes: e6d8b64b34 ("net: sctp: fix and consolidate SCTP checksumming code")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18 18:16:12 -07:00
Maxime Chevallier
a4dc6a4915 packets: Always register packet sk in the same order
When using fanouts with AF_PACKET, the demux functions such as
fanout_demux_cpu will return an index in the fanout socket array, which
corresponds to the selected socket.

The ordering of this array depends on the order the sockets were added
to a given fanout group, so for FANOUT_CPU this means sockets are bound
to cpus in the order they are configured, which is OK.

However, when stopping then restarting the interface these sockets are
bound to, the sockets are reassigned to the fanout group in the reverse
order, due to the fact that they were inserted at the head of the
interface's AF_PACKET socket list.

This means that traffic that was directed to the first socket in the
fanout group is now directed to the last one after an interface restart.

In the case of FANOUT_CPU, traffic from CPU0 will be directed to the
socket that used to receive traffic from the last CPU after an interface
restart.

This commit introduces a helper to add a socket at the tail of a list,
then uses it to register AF_PACKET sockets.

Note that this changes the order in which sockets are listed in /proc and
with sock_diag.

Fixes: dc99f60069 ("packet: Add fanout support")
Signed-off-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-18 17:58:39 -07:00
David S. Miller
0aedadcf6b Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2019-03-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a umem memory leak on cleanup in AF_XDP, from Björn.

2) Fix BTF to properly resolve forward-declared enums into their corresponding
   full enum definition types during deduplication, from Andrii.

3) Fix libbpf to reject invalid flags in xsk_socket__create(), from Magnus.

4) Fix accessing invalid pointer returned from bpf_tcp_sock() and
   bpf_sk_fullsock() after bpf_sk_release() was called, from Martin.

5) Fix generation of load/store DW instructions in PPC JIT, from Naveen.

6) Various fixes in BPF helper function documentation in bpf.h UAPI header
   used to bpf-helpers(7) man page, from Quentin.

7) Fix segfault in BPF test_progs when prog loading failed, from Yonghong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-16 12:20:08 -07:00
Björn Töpel
044175a067 xsk: fix umem memory leak on cleanup
When the umem is cleaned up, the task that created it might already be
gone. If the task was gone, the xdp_umem_release function did not free
the pages member of struct xdp_umem.

It turned out that the task lookup was not needed at all; The code was
a left-over when we moved from task accounting to user accounting [1].

This patch fixes the memory leak by removing the task lookup logic
completely.

[1] https://lore.kernel.org/netdev/20180131135356.19134-3-bjorn.topel@gmail.com/

Link: https://lore.kernel.org/netdev/c1cb2ca8-6a14-3980-8672-f3de0bb38dfd@suse.cz/
Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Reported-by: Jiri Slaby <jslaby@suse.cz>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-03-16 01:27:51 +01:00
Linus Torvalds
f3ca4c55a6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "More fixes in the queue:

  1) Netfilter nat can erroneously register the device notifier twice,
     fix from Florian Westphal.

  2) Use after free in nf_tables, from Pablo Neira Ayuso.

  3) Parallel update of steering rule fix in mlx5 river, from Eli
     Britstein.

  4) RX processing panic in lan743x, fix from Bryan Whitehead.

  5) Use before initialization of TCP_SKB_CB, fix from Christoph Paasch.

  6) Fix locking in SRIOV mode of mlx4 driver, from Jack Morgenstein.

  7) Fix TX stalls in lan743x due to mishandling of interrupt ACKing
     modes, from Bryan Whitehead.

  8) Fix infoleak in l2tp_ip6_recvmsg(), from Eric Dumazet"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (43 commits)
  pptp: dst_release sk_dst_cache in pptp_sock_destruct
  MAINTAINERS: GENET & SYSTEMPORT: Add internal Broadcom list
  l2tp: fix infoleak in l2tp_ip6_recvmsg()
  net/tls: Inform user space about send buffer availability
  net_sched: return correct value for *notify* functions
  lan743x: Fix TX Stall Issue
  net/mlx4_core: Fix qp mtt size calculation
  net/mlx4_core: Fix locking in SRIOV mode when switching between events and polling
  net/mlx4_core: Fix reset flow when in command polling mode
  mlxsw: minimal: Initialize base_mac
  mlxsw: core: Prevent duplication during QSFP module initialization
  net: dwmac-sun8i: fix a missing check of of_get_phy_mode
  net: sh_eth: fix a missing check of of_get_phy_mode
  net: 8390: fix potential NULL pointer dereferences
  net: fujitsu: fix a potential NULL pointer dereference
  net: qlogic: fix a potential NULL pointer dereference
  isdn: hfcpci: fix potential NULL pointer dereference
  Documentation: devicetree: add a new optional property for port mac address
  net: rocker: fix a potential NULL pointer dereference
  net: qlge: fix a potential NULL pointer dereference
  ...
2019-03-14 09:28:12 -07:00
Kent Overstreet
2075e50caf sctp: convert to genradix
This also makes sctp_stream_alloc_(out|in) saner, in that they no longer
allocate new flex_arrays/genradixes, they just preallocate more
elements.

This code does however have a suspicious lack of locking.

Link: http://lkml.kernel.org/r/20181217131929.11727-7-kent.overstreet@gmail.com
Signed-off-by: Kent Overstreet <kent.overstreet@gmail.com>
Cc: Vlad Yasevich <vyasevich@gmail.com>
Cc: Neil Horman <nhorman@tuxdriver.com>
Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Cc: Alexey Dobriyan <adobriyan@gmail.com>
Cc: Al Viro <viro@zeniv.linux.org.uk>
Cc: Dave Hansen <dave.hansen@intel.com>
Cc: Eric Paris <eparis@parisplace.org>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Paul Moore <paul@paul-moore.com>
Cc: Pravin B Shelar <pshelar@ovn.org>
Cc: Shaohua Li <shli@kernel.org>
Cc: Stephen Smalley <sds@tycho.nsa.gov>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-03-12 10:04:02 -07:00
David S. Miller
ab0891f5d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for your net tree:

1) Fix list corruption in device notifier in the masquerade
   infrastructure, from Florian Westphal.

2) Fix double-free of sets and use-after-free when deleting elements.

3) Don't bogusly return EBUSY when removing a set after flush command.

4) Use-after-free in dynamically allocate operations.

5) Don't report a new ruleset generation to userspace if transaction
   list is empty, this invalidates the userspace cache innecessarily.
   From Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-11 16:14:14 -07:00
Linus Torvalds
8f49a658b4 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "First batch of fixes in the new merge window:

   1) Double dst_cache free in act_tunnel_key, from Wenxu.

   2) Avoid NULL deref in IN_DEV_MFORWARD() by failing early in the
      ip_route_input_rcu() path, from Paolo Abeni.

   3) Fix appletalk compile regression, from Arnd Bergmann.

   4) If SLAB objects reach the TCP sendpage method we are in serious
      trouble, so put a debugging check there. From Vasily Averin.

   5) Memory leak in hsr layer, from Mao Wenan.

   6) Only test GSO type on GSO packets, from Willem de Bruijn.

   7) Fix crash in xsk_diag_put_umem(), from Eric Dumazet.

   8) Fix VNIC mailbox length in nfp, from Dirk van der Merwe.

   9) Fix race in ipv4 route exception handling, from Xin Long.

  10) Missing DMA memory barrier in hns3 driver, from Jian Shen.

  11) Use after free in __tcf_chain_put(), from Vlad Buslov.

  12) Handle inet_csk_reqsk_queue_add() failures, from Guillaume Nault.

  13) Return value correction when ip_mc_may_pull() fails, from Eric
      Dumazet.

  14) Use after free in x25_device_event(), also from Eric"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (72 commits)
  gro_cells: make sure device is up in gro_cells_receive()
  vxlan: test dev->flags & IFF_UP before calling gro_cells_receive()
  net/x25: fix use-after-free in x25_device_event()
  isdn: mISDNinfineon: fix potential NULL pointer dereference
  net: hns3: fix to stop multiple HNS reset due to the AER changes
  ip: fix ip_mc_may_pull() return value
  net: keep refcount warning in reqsk_free()
  net: stmmac: Avoid one more sometimes uninitialized Clang warning
  net: dsa: mv88e6xxx: Set correct interface mode for CPU/DSA ports
  rxrpc: Fix client call queueing, waiting for channel
  tcp: handle inet_csk_reqsk_queue_add() failures
  net: ethernet: sun: Zero initialize class in default case in niu_add_ethtool_tcam_entry
  8139too : Add support for U.S. Robotics USR997901A 10/100 Cardbus NIC
  fou, fou6: avoid uninit-value in gue_err() and gue6_err()
  net: sched: fix potential use-after-free in __tcf_chain_put()
  vhost: silence an unused-variable warning
  vsock/virtio: fix kernel panic from virtio_transport_reset_no_sock
  connector: fix unsafe usage of ->real_parent
  vxlan: do not need BH again in vxlan_cleanup()
  net: hns3: add dma_rmb() for rx description
  ...
2019-03-11 08:54:01 -07:00
Pablo Neira Ayuso
273fe3f100 netfilter: nf_tables: bogus EBUSY when deleting set after flush
Set deletion after flush coming in the same batch results in EBUSY. Add
set use counter to track the number of references to this set from
rules. We cannot rely on the list of bindings for this since such list
is still populated from the preparation phase.

Reported-by: Václav Zindulka <vaclav.zindulka@tlapnet.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-11 13:19:24 +01:00
Guillaume Nault
1039c6e193 net: keep refcount warning in reqsk_free()
As Eric Dumazet said, "We do not have a way to tell if the req was ever
inserted in a hash table, so better play safe.".
Let's remove this comment, so that nobody will be tempted to drop the
WARN_ON_ONCE() line.

Signed-off-by: Guillaume Nault <gnault@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-09 19:51:44 -08:00
Linus Torvalds
38e7571c07 io_uring-2019-03-06
-----BEGIN PGP SIGNATURE-----
 
 iQJEBAABCAAuFiEEwPw5LcreJtl1+l5K99NY+ylx4KYFAlyAJvAQHGF4Ym9lQGtl
 cm5lbC5kawAKCRD301j7KXHgphb+EACFaKI2HIdjExQ5T7Cxebzwky+Qiro3FV55
 ziW00FZrkJ5g0h4ItBzh/5SDlcNQYZDMlA3s4xzWIMadWl5PjMPq1uJul0cITbSl
 WIJO5hpgNMXeUEhvcXUl6+f/WzpgYUxN40uW8N5V7EKlooaFVfudDqJGlvEv+UgB
 g8NWQYThSG+/e7r9OGwK0xDRVKfpjxVvmqmnDH3DrxKaDgSOwTf4xn1u41wKwfQ3
 3uPfQ+GBeTqt4a2AhOi7K6KQFNnj5Jz5CXYMiOZI2JGtLPcL6dmyBVD7K0a0HUr+
 rs4ghNdd1+puvPGNK4TX8qV0uiNrMctoRNVA/JDd1ZTYEKTmNLxeFf+olfYHlwuK
 K5FRs60/lgNzNkzcUpFvJHitPwYtxYJdB36PyswE1FZP1YviEeVoKNt9W8aIhEoA
 549uj90brfA74eCINGhq98pJqj9CNyCPw3bfi76f5Ej2utwYDb9S5Cp2gfSa853X
 qc/qNda9efEq7ikwCbPzhekRMXZo6TSXtaSmC2C+Vs5+mD1Scc4kdAvdCKGQrtr9
 aoy0iQMYO2NDZ/G5fppvXtMVuEPAZWbsGftyOe15IlMysjRze2ycJV8cFahKEVM9
 uBeXLyH1pqGU/j7ABP4+XRZ/sbHJTwjKJbnXhTgBsdU8XO/CR3U+kRQFTsidKMfH
 Wlo3uH2h2A==
 =p78E
 -----END PGP SIGNATURE-----

Merge tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block

Pull io_uring IO interface from Jens Axboe:
 "Second attempt at adding the io_uring interface.

  Since the first one, we've added basic unit testing of the three
  system calls, that resides in liburing like the other unit tests that
  we have so far. It'll take a while to get full coverage of it, but
  we're working towards it. I've also added two basic test programs to
  tools/io_uring. One uses the raw interface and has support for all the
  various features that io_uring supports outside of standard IO, like
  fixed files, fixed IO buffers, and polled IO. The other uses the
  liburing API, and is a simplified version of cp(1).

  This adds support for a new IO interface, io_uring.

  io_uring allows an application to communicate with the kernel through
  two rings, the submission queue (SQ) and completion queue (CQ) ring.
  This allows for very efficient handling of IOs, see the v5 posting for
  some basic numbers:

    https://lore.kernel.org/linux-block/20190116175003.17880-1-axboe@kernel.dk/

  Outside of just efficiency, the interface is also flexible and
  extendable, and allows for future use cases like the upcoming NVMe
  key-value store API, networked IO, and so on. It also supports async
  buffered IO, something that we've always failed to support in the
  kernel.

  Outside of basic IO features, it supports async polled IO as well.
  This particular feature has already been tested at Facebook months ago
  for flash storage boxes, with 25-33% improvements. It makes polled IO
  actually useful for real world use cases, where even basic flash sees
  a nice win in terms of efficiency, latency, and performance. These
  boxes were IOPS bound before, now they are not.

  This series adds three new system calls. One for setting up an
  io_uring instance (io_uring_setup(2)), one for submitting/completing
  IO (io_uring_enter(2)), and one for aux functions like registrating
  file sets, buffers, etc (io_uring_register(2)). Through the help of
  Arnd, I've coordinated the syscall numbers so merge on that front
  should be painless.

  Jon did a writeup of the interface a while back, which (except for
  minor details that have been tweaked) is still accurate. Find that
  here:

    https://lwn.net/Articles/776703/

  Huge thanks to Al Viro for helping getting the reference cycle code
  correct, and to Jann Horn for his extensive reviews focused on both
  security and bugs in general.

  There's a userspace library that provides basic functionality for
  applications that don't need or want to care about how to fiddle with
  the rings directly. It has helpers to allow applications to easily set
  up an io_uring instance, and submit/complete IO through it without
  knowing about the intricacies of the rings. It also includes man pages
  (thanks to Jeff Moyer), and will continue to grow support helper
  functions and features as time progresses. Find it here:

    git://git.kernel.dk/liburing

  Fio has full support for the raw interface, both in the form of an IO
  engine (io_uring), but also with a small test application (t/io_uring)
  that can exercise and benchmark the interface"

* tag 'io_uring-2019-03-06' of git://git.kernel.dk/linux-block:
  io_uring: add a few test tools
  io_uring: allow workqueue item to handle multiple buffered requests
  io_uring: add support for IORING_OP_POLL
  io_uring: add io_kiocb ref count
  io_uring: add submission polling
  io_uring: add file set registration
  net: split out functions related to registering inflight socket files
  io_uring: add support for pre-mapped user IO buffers
  block: implement bio helper to add iter bvec pages to bio
  io_uring: batch io_kiocb allocation
  io_uring: use fget/fput_many() for file references
  fs: add fget_many() and fput_many()
  io_uring: support for IO polling
  io_uring: add fsync support
  Add io_uring IO interface
2019-03-08 14:48:40 -08:00
Pablo Neira Ayuso
40ba1d9b4d netfilter: nf_tables: fix set double-free in abort path
The abort path can cause a double-free of an anonymous set.
Added-and-to-be-aborted rule looks like this:

udp dport { 137, 138 } drop

The to-be-aborted transaction list looks like this:

newset
newsetelem
newsetelem
rule

This gets walked in reverse order, so first pass disables the rule, the
set elements, then the set.

After synchronize_rcu(), we then destroy those in same order: rule, set
element, set element, newset.

Problem is that the anonymous set has already been bound to the rule, so
the rule (lookup expression destructor) already frees the set, when then
cause use-after-free when trying to delete the elements from this set,
then try to free the set again when handling the newset expression.

Rule releases the bound set in first place from the abort path, this
causes the use-after-free on set element removal when undoing the new
element transactions. To handle this, skip new element transaction if
set is bound from the abort path.

This is still causes the use-after-free on set element removal.  To
handle this, remove transaction from the list when the set is already
bound.

Joint work with Florian Westphal.

Fixes: f6ac858589 ("netfilter: nf_tables: unbind set in rule from commit path")
Bugzilla: https://bugzilla.netfilter.org/show_bug.cgi?id=1325
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-08 16:41:18 +01:00
Su Yanjun
f10e0010fa net: xfrm: Add '_rcu' tag for rcu protected pointer in netns_xfrm
For rcu protected pointers, we'd better add '__rcu' for them.

Once added '__rcu' tag for rcu protected pointer, the sparse tool reports
warnings.

net/xfrm/xfrm_user.c:1198:39: sparse:    expected struct sock *sk
net/xfrm/xfrm_user.c:1198:39: sparse:    got struct sock [noderef] <asn:4> *nlsk
[...]

So introduce a new wrapper function of nlmsg_unicast  to handle type
conversions.

This patch also fixes a direct access of a rcu protected socket.

Fixes: be33690d8fcf("[XFRM]: Fix aevent related crash")
Signed-off-by: Su Yanjun <suyj.fnst@cn.fujitsu.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-03-08 13:17:31 +01:00
David S. Miller
18a4d8bf25 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-04 13:26:15 -08:00
Eran Ben Elisha
3167b27a7d devlink: Add support for direct reporter health state update
It is possible that a reporter state will be updated due to a recover flow
which is not triggered by a devlink health related operation, but as a side
effect of some other operation in the system.

Expose devlink health API for a direct update of a reporter status.

Move devlink_health_reporter_state enum definition to devlink.h so it could
be used from drivers as a parameter of devlink_health_reporter_state_update.

In addition, add trace_devlink_health_reporter_state_update to provide user
notification for reporter state change.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-04 11:00:43 -08:00
Boris Pismenny
7463d3a2db tls: Fix write space handling
TLS device cannot use the sw context. This patch returns the original
tls device write space handler and moves the sw/device specific portions
to the relevant files.

Also, we remove the write_space call for the tls_sw flow, because it
handles partial records in its delayed tx work handler.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Reviewed-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-03 22:10:16 -08:00
Boris Pismenny
94850257cf tls: Fix tls_device handling of partial records
Cleanup the handling of partial records while fixing a bug where the
tls_push_pending_closed_record function is using the software tls
context instead of the hardware context.

The bug resulted in the following crash:
[   88.791229] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[   88.793271] #PF error: [normal kernel read fault]
[   88.794449] PGD 800000022a426067 P4D 800000022a426067 PUD 22a156067 PMD 0
[   88.795958] Oops: 0000 [#1] SMP PTI
[   88.796884] CPU: 2 PID: 4973 Comm: openssl Not tainted 5.0.0-rc4+ #3
[   88.798314] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[   88.800067] RIP: 0010:tls_tx_records+0xef/0x1d0 [tls]
[   88.801256] Code: 00 02 48 89 43 08 e8 a0 0b 96 d9 48 89 df e8 48 dd
4d d9 4c 89 f8 4d 8b bf 98 00 00 00 48 05 98 00 00 00 48 89 04 24 49 39
c7 <49> 8b 1f 4d 89 fd 0f 84 af 00 00 00 41 8b 47 10 85 c0 0f 85 8d 00
[   88.805179] RSP: 0018:ffffbd888186fca8 EFLAGS: 00010213
[   88.806458] RAX: ffff9af1ed657c98 RBX: ffff9af1e88a1980 RCX: 0000000000000000
[   88.808050] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff9af1e88a1980
[   88.809724] RBP: ffff9af1e88a1980 R08: 0000000000000017 R09: ffff9af1ebeeb700
[   88.811294] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[   88.812917] R13: ffff9af1e88a1980 R14: ffff9af1ec13f800 R15: 0000000000000000
[   88.814506] FS:  00007fcad2240740(0000) GS:ffff9af1f7880000(0000) knlGS:0000000000000000
[   88.816337] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   88.817717] CR2: 0000000000000000 CR3: 0000000228b3e000 CR4: 00000000001406e0
[   88.819328] Call Trace:
[   88.820123]  tls_push_data+0x628/0x6a0 [tls]
[   88.821283]  ? remove_wait_queue+0x20/0x60
[   88.822383]  ? n_tty_read+0x683/0x910
[   88.823363]  tls_device_sendmsg+0x53/0xa0 [tls]
[   88.824505]  sock_sendmsg+0x36/0x50
[   88.825492]  sock_write_iter+0x87/0x100
[   88.826521]  __vfs_write+0x127/0x1b0
[   88.827499]  vfs_write+0xad/0x1b0
[   88.828454]  ksys_write+0x52/0xc0
[   88.829378]  do_syscall_64+0x5b/0x180
[   88.830369]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   88.831603] RIP: 0033:0x7fcad1451680

[ 1248.470626] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 1248.472564] #PF error: [normal kernel read fault]
[ 1248.473790] PGD 0 P4D 0
[ 1248.474642] Oops: 0000 [#1] SMP PTI
[ 1248.475651] CPU: 3 PID: 7197 Comm: openssl Tainted: G           OE 5.0.0-rc4+ #3
[ 1248.477426] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
[ 1248.479310] RIP: 0010:tls_tx_records+0x110/0x1f0 [tls]
[ 1248.480644] Code: 00 02 48 89 43 08 e8 4f cb 63 d7 48 89 df e8 f7 9c
1b d7 4c 89 f8 4d 8b bf 98 00 00 00 48 05 98 00 00 00 48 89 04 24 49 39
c7 <49> 8b 1f 4d 89 fd 0f 84 af 00 00 00 41 8b 47 10 85 c0 0f 85 8d 00
[ 1248.484825] RSP: 0018:ffffaa0a41543c08 EFLAGS: 00010213
[ 1248.486154] RAX: ffff955a2755dc98 RBX: ffff955a36031980 RCX: 0000000000000006
[ 1248.487855] RDX: 0000000000000000 RSI: 000000000000002b RDI: 0000000000000286
[ 1248.489524] RBP: ffff955a36031980 R08: 0000000000000000 R09: 00000000000002b1
[ 1248.491394] R10: 0000000000000003 R11: 00000000ad55ad55 R12: 0000000000000000
[ 1248.493162] R13: 0000000000000000 R14: ffff955a2abe6c00 R15: 0000000000000000
[ 1248.494923] FS:  0000000000000000(0000) GS:ffff955a378c0000(0000) knlGS:0000000000000000
[ 1248.496847] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1248.498357] CR2: 0000000000000000 CR3: 000000020c40e000 CR4: 00000000001406e0
[ 1248.500136] Call Trace:
[ 1248.500998]  ? tcp_check_oom+0xd0/0xd0
[ 1248.502106]  tls_sk_proto_close+0x127/0x1e0 [tls]
[ 1248.503411]  inet_release+0x3c/0x60
[ 1248.504530]  __sock_release+0x3d/0xb0
[ 1248.505611]  sock_close+0x11/0x20
[ 1248.506612]  __fput+0xb4/0x220
[ 1248.507559]  task_work_run+0x88/0xa0
[ 1248.508617]  do_exit+0x2cb/0xbc0
[ 1248.509597]  ? core_sys_select+0x17a/0x280
[ 1248.510740]  do_group_exit+0x39/0xb0
[ 1248.511789]  get_signal+0x1d0/0x630
[ 1248.512823]  do_signal+0x36/0x620
[ 1248.513822]  exit_to_usermode_loop+0x5c/0xc6
[ 1248.515003]  do_syscall_64+0x157/0x180
[ 1248.516094]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 1248.517456] RIP: 0033:0x7fb398bd3f53
[ 1248.518537] Code: Bad RIP value.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption of records for performance")
Signed-off-by: Boris Pismenny <borisp@mellanox.com>
Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-03 22:10:16 -08:00
Tristram Ha
88b573af91 net: dsa: add KSZ9893 switch tagging support
KSZ9893 switch is similar to KSZ9477 switch except the ingress tail tag
has 1 byte instead of 2 bytes.  The size of the portmap is smaller and
so the override and lookup bits are also moved.

Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-03 13:48:49 -08:00
Eric Dumazet
46b1c18f9d net: sched: put back q.qlen into a single location
In the series fc8b81a598 ("Merge branch 'lockless-qdisc-series'")
John made the assumption that the data path had no need to read
the qdisc qlen (number of packets in the qdisc).

It is true when pfifo_fast is used as the root qdisc, or as direct MQ/MQPRIO
children.

But pfifo_fast can be used as leaf in class full qdiscs, and existing
logic needs to access the child qlen in an efficient way.

HTB breaks badly, since it uses cl->leaf.q->q.qlen in :
  htb_activate() -> WARN_ON()
  htb_dequeue_tree() to decide if a class can be htb_deactivated
  when it has no more packets.

HFSC, DRR, CBQ, QFQ have similar issues, and some calls to
qdisc_tree_reduce_backlog() also read q.qlen directly.

Using qdisc_qlen_sum() (which iterates over all possible cpus)
in the data path is a non starter.

It seems we have to put back qlen in a central location,
at least for stable kernels.

For all qdisc but pfifo_fast, qlen is guarded by the qdisc lock,
so the existing q.qlen{++|--} are correct.

For 'lockless' qdisc (pfifo_fast so far), we need to use atomic_{inc|dec}()
because the spinlock might be not held (for example from
pfifo_fast_enqueue() and pfifo_fast_dequeue())

This patch adds atomic_qlen (in the same location than qlen)
and renames the following helpers, since we want to express
they can be used without qdisc lock, and that qlen is no longer percpu.

- qdisc_qstats_cpu_qlen_dec -> qdisc_qstats_atomic_qlen_dec()
- qdisc_qstats_cpu_qlen_inc -> qdisc_qstats_atomic_qlen_inc()

Later (net-next) we might revert this patch by tracking all these
qlen uses and replace them by a more efficient method (not having
to access a precise qlen, but an empty/non_empty status that might
be less expensive to maintain/track).

Another possibility is to have a legacy pfifo_fast version that would
be used when used a a child qdisc, since the parent qdisc needs
a spinlock anyway. But then, future lockless qdiscs would also
have the same problem.

Fixes: 7e66016f2c ("net: sched: helpers to sum qlen and qlen for per cpu logic")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: John Fastabend <john.fastabend@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02 14:10:18 -08:00
David S. Miller
4e7df119d9 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for net-next:

1) Add .release_ops to properly unroll .select_ops, use it from nft_compat.
   After this change, we can remove list of extensions too to simplify this
   codebase.

2) Update amanda conntrack helper to support v3.4, from Florian Tham.

3) Get rid of the obsolete BUGPRINT macro in ebtables, from
   Florian Westphal.

4) Merge IPv4 and IPv6 masquerading infrastructure into one single module.
   From Florian Westphal.

5) Patchset to remove nf_nat_l3proto structure to get rid of
   indirections, from Florian Westphal.

6) Skip unnecessary conntrack timeout updates in case the value is
   still the same, also from Florian Westphal.

7) Remove unnecessary 'fall through' comments in empty switch cases,
   from Li RongQing.

8) Fix lookup to fixed size hashtable sets on big endian with 32-bit keys.

9) Incorrect logic to deactivate path of fixed size hashtable sets,
   element was being tested to self.

10) Remove nft_hash_key(), the bitmap set is always selected for 16-bit
    keys.

11) Use boolean whenever possible in IPVS codebase, from Andrea Claudi.

12) Enter close state in conntrack if RST matches exact sequence number,
    from Florian Westphal.

13) Initialize dst_cache in tunnel extension, from wenxu.

14) Pass protocol as u16 to xt_check_match and xt_check_target, from
    Li RongQing.

15) SCTP header is granted to be in a linear area from IPVS NAT handler,
    from Xin Long.

16) Don't steal packets coming from slave VRF device from the
    ip_sabotage_in() path, from David Ahern.

17) Fix unsafe update of basechain stats, from Li RongQing.

18) Make sure CONNTRACK_LOCKS is power of 2 to let compiler optimize
    modulo operation as bitwise AND, from Li RongQing.

19) Use device_attribute instead of internal definition in the IDLETIMER
    target, from Sami Tolvanen.

20) Merge redir, masq and IPv4/IPv6 NAT chain types, from Florian Westphal.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02 14:01:04 -08:00
David S. Miller
2369afb669 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2019-03-02

Here's one more bluetooth-next pull request for the 5.1 kernel:

 - Added support for MediaTek MT7663U and MT7668U UART devices
 - Cleanups & fixes to the hci_qca driver
 - Fixed wakeup pin behavior for QCA6174A controller

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-02 13:55:36 -08:00
David S. Miller
9eb359140c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-03-02 12:54:35 -08:00
Florian Fainelli
91cf8eceff switchdev: Remove unused transaction item queue
There are no more in tree users of the
switchdev_trans_item_{dequeue,enqueue} or switchdev_trans_item structure
in the kernel since commit 00fc0c51e3 ("rocker: Change world_ops API
and implementation to be switchdev independant").

Remove this unused code and update the documentation accordingly since.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01 21:35:19 -08:00
Hangbin Liu
5e1a99eae8 ipv4: Add ICMPv6 support when parse route ipproto
For ip rules, we need to use 'ipproto ipv6-icmp' to match ICMPv6 headers.
But for ip -6 route, currently we only support tcp, udp and icmp.

Add ICMPv6 support so we can match ipv6-icmp rules for route lookup.

v2: As David Ahern and Sabrina Dubroca suggested, Add an argument to
rtm_getroute_parse_ip_proto() to handle ICMP/ICMPv6 with different family.

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: eacb9384a3 ("ipv6: support sport, dport and ip_proto in RTM_GETROUTE")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-03-01 16:41:27 -08:00
Florian Westphal
a9ce849e78 netfilter: nf_tables: nat: merge nft_masq protocol specific modules
The family specific masq modules are way too small to warrant
an extra module, just place all of them in nft_masq.

before:
  text	   data	    bss	    dec	    hex	filename
   1001	    832	      0	   1833	    729	nft_masq.ko
    766	    896	      0	   1662	    67e	nft_masq_ipv4.ko
    764	    896	      0	   1660	    67c	nft_masq_ipv6.ko

after:
   2010	    960	      0	   2970	    b9a	nft_masq.ko

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:59 +01:00
Florian Westphal
c78efc99c7 netfilter: nf_tables: nat: merge nft_redir protocol specific modules
before:
 text	   data	    bss	    dec	    hex	filename
 990	    832	      0	   1822	    71e nft_redir.ko
 697	    896	      0	   1593	    639 nft_redir_ipv4.ko
 713	    896	      0	   1609	    649	nft_redir_ipv6.ko

after:
 text	   data	    bss	    dec	    hex	filename
 1910	    960	      0	   2870	    b36	nft_redir.ko

size is reduced, all helpers from nft_redir.ko can be made static.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-03-01 14:36:58 +01:00
Jens Axboe
f4e65870e5 net: split out functions related to registering inflight socket files
We need this functionality for the io_uring file registration, but
we cannot rely on it since CONFIG_UNIX can be modular. Move the helpers
to a separate file, that's always builtin to the kernel if CONFIG_UNIX is
m/y.

No functional changes in this patch, just moving code around.

Reviewed-by: Hannes Reinecke <hare@suse.com>
Acked-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
2019-02-28 08:24:23 -07:00
wenxu
24ba14406c route: Add multipath_hash in flowi_common to make user-define hash
Current fib_multipath_hash_policy can make hash based on the L3 or
L4. But it only work on the outer IP. So a specific tunnel always
has the same hash value. But a specific tunnel may contain so many
inner connections.

This patch provide a generic multipath_hash in floi_common. It can
make a user-define hash which can mix with L3 or L4 hash.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27 12:50:17 -08:00
Florian Fainelli
3d705f07d1 net: Remove switchdev_ops
Now that we have converted all possible callers to using a switchdev
notifier for attributes we do not have a need for implementing
switchdev_ops anymore, and this can be removed from all drivers the
net_device structure.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27 12:39:56 -08:00
Florian Fainelli
1cb33af1f7 switchdev: Add SWITCHDEV_PORT_ATTR_SET
In preparation for allowing switchdev enabled drivers to veto specific
attribute settings from within the context of the caller, introduce a
new switchdev notifier type for port attributes.

Suggested-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-27 12:39:55 -08:00
Florian Westphal
cc16921351 netfilter: conntrack: avoid same-timeout update
No need to dirty a cache line if timeout is unchanged.
Also, WARN() is useless here: we crash on 'skb->len' access
if skb is NULL.

Last, ct->timeout is u32, not 'unsigned long' so adapt the
function prototype accordingly.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:58:21 +01:00
Florian Westphal
d2c5c103b1 netfilter: nat: remove nf_nat_l3proto.h and nf_nat_core.h
The l3proto name is gone, its header file is the last trace.
While at it, also remove nf_nat_core.h, its very small and all users
include nf_nat.h too.

before:
   text    data     bss     dec     hex filename
  22948    1612    4136   28696    7018 nf_nat.ko

after removal of l3proto register/unregister functions:
   text	   data	    bss	    dec	    hex	filename
  22196	   1516	   4136	  27848	   6cc8 nf_nat.ko

checkpatch complains about overly long lines, but line breaks
do not make things more readable and the line length gets smaller
here, not larger.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:54:08 +01:00
Florian Westphal
d6c4c8ffb5 netfilter: nat: remove l3proto struct
All l3proto function pointers have been removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:57 +01:00
Florian Westphal
dac3fe7259 netfilter: nat: remove csum_recalc hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:47 +01:00
Florian Westphal
03fe5efc4c netfilter: nat: remove csum_update hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:35 +01:00
Florian Westphal
2e666b229d netfilter: nat: remove l3 manip_pkt hook
We can now use direct calls.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:53:05 +01:00
Florian Westphal
14cb1a6e29 netfilter: nat: remove nf_nat_l4proto.h
after ipv4/6 nat tracker merge, there are no external callers, so
make last function static and remove the header.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:52:47 +01:00
Florian Westphal
096d09067a netfilter: nat: move nlattr parse and xfrm session decode to core
None of these functions calls any external functions, moving them allows
to avoid both the indirection and a need to export these symbols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:42 +01:00
Florian Westphal
d1aca8ab31 netfilter: nat: merge ipv4 and ipv6 masquerade functionality
Before:
   text	   data	    bss	    dec	    hex	filename
  13916	   1412	   4128	  19456	   4c00	nf_nat.ko
   4510	    968	      4	   5482	   156a	nf_nat_ipv4.ko
   5146	    944	      8	   6098	   17d2	nf_nat_ipv6.ko

After:
   text	   data	    bss	    dec	    hex	filename
  16566	   1576	   4136	  22278	   5706	nf_nat.ko
   3187	    844	      0	   4031	    fbf	nf_nat_ipv4.ko
   3598	    844	      0	   4442	   115a	nf_nat_ipv6.ko

... so no drastic changes in combined size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:49:24 +01:00
Pablo Neira Ayuso
b8e2040063 netfilter: nft_compat: use .release_ops and remove list of extension
Add .release_ops, that is called in case of error at a later stage in
the expression initialization path, ie. .select_ops() has been already
set up operations and that needs to be undone. This allows us to unwind
.select_ops from the error path, ie. release the dynamic operations for
this extension.

Moreover, allocate one single operation instead of recycling them, this
comes at the cost of consuming a bit more memory per rule, but it
simplifies the infrastructure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-27 10:41:24 +01:00
Eric Dumazet
921f9a0f2e tcp: convert tcp_md5_needed to static_branch API
We prefer static_branch_unlikely() over static_key_false() these days.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 13:16:03 -08:00
Eric Dumazet
a43e052bea tcp: get rid of __tcp_add_write_queue_tail()
This helper is only used from tcp_add_write_queue_tail(), and does
not make the code more readable.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 13:16:03 -08:00
Eric Dumazet
6c7b4ee7f9 tcp: get rid of tcp_check_send_head()
This helper is used only once, and its name is no longer relevant.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 13:16:02 -08:00
Roopa Prabhu
70fb082880 vxlan: add extack support for create and changelink
This patch adds extack coverage in vxlan link
create and changelink paths. Introduces a new helper
vxlan_nl2flags to consolidate flag attribute validation.

thanks to Johannes Berg for some tips to construct the
generic vxlan flag extack strings.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 08:54:37 -08:00
Jakub Kicinski
b473b0d235 devlink: create a special NDO for getting the devlink instance
Instead of iterating over all devlink ports add a NDO which
will return the devlink instance from the driver.

v2: add the netdev_to_devlink() helper (Michal)
v3: check that devlink has ops (Florian)
v4: hold devlink_mutex (Jiri)

Suggested-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 08:49:05 -08:00
Jakub Kicinski
f4b6bcc700 net: devlink: turn devlink into a built-in
Being able to build devlink as a module causes growing pains.
First all drivers had to add a meta dependency to make sure
they are not built in when devlink is built as a module.  Now
we are struggling to invoke ethtool compat code reliably.

Make devlink code built-in, users can still not build it at
all but the dynamically loadable module option is removed.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 08:49:05 -08:00
Peter Oskolkov
d8cf757fbd net: remove unused struct inet_frag_queue.fragments field
Now that all users of struct inet_frag_queue have been converted
to use 'rb_fragments', remove the unused 'fragments' field.

Build with `make allyesconfig` succeeded. ip_defrag selftest passed.

Signed-off-by: Peter Oskolkov <posk@google.com>
Acked-by: Stefan Schmidt <stefan@datenfreihafen.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-26 08:27:05 -08:00
Matthias Kaehlcke
7a0e5b15ca Bluetooth: Add quirk for reading BD_ADDR from fwnode property
Add HCI_QUIRK_USE_BDADDR_PROPERTY to allow controllers to retrieve
the public Bluetooth address from the firmware node property
'local-bd-address'. If quirk is set and the property does not exist
or is invalid the controller is marked as unconfigured.

Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Tested-by: Balakrishna Godavarthi <bgodavar@codeaurora.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-02-26 10:08:26 +01:00
Nazarov Sergey
3da1ed7ac3 net: avoid use IPCB in cipso_v4_error
Extract IP options in cipso_v4_error and use __icmp_send.

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25 14:32:35 -08:00
Nazarov Sergey
9ef6b42ad6 net: Add __icmp_send helper.
Add __icmp_send function having ip_options struct parameter

Signed-off-by: Sergey Nazarov <s-nazarov@yandex.ru>
Reviewed-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25 14:32:35 -08:00
Vlad Buslov
6676d5e416 net: sched: set dedicated tcf_walker flag when tp is empty
Using tcf_walker->stop flag to determine when tcf_walker->fn() was called
at least once is unreliable. Some classifiers set 'stop' flag on error
before calling walker callback, other classifiers used to call it with NULL
filter pointer when empty. In order to prevent further regressions, extend
tcf_walker structure with dedicated 'nonempty' flag. Set this flag in
tcf_walker->fn() implementation that is used to check if classifier has
filters configured.

Fixes: 8b64678e0a ("net: sched: refactor tp insert/delete for concurrent execution")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-25 10:18:17 -08:00
Florian Fainelli
72636db5ba switchdev: Complete removal of switchdev_port_attr_get()
We have no more in tree users of switchdev_port_attr_get() after
d0e698d57a ("Merge branch 'net-Get-rid-of-switchdev_port_attr_get'")
so completely remove the function signature and body.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 22:31:41 -08:00
Andrew Lunn
75104db0cb dsa: Remove phydev parameter from disable_port call
No current DSA driver makes use of the phydev parameter passed to the
disable_port call. Remove it.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 22:30:34 -08:00
David S. Miller
e8b47b53a1 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
Here's the main bluetooth-next pull request for the 5.1 kernel.

 - Fixes & improvements to mediatek, hci_qca, btrtl, and btmrvl HCI drivers
 - Fixes to parsing invalid L2CAP config option sizes
 - Locking fix to bt_accept_enqueue()
 - Add support for new Marvel sd8977 chipset
 - Various other smaller fixes & cleanups
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 22:27:19 -08:00
Vakul Garg
2b794c4098 tls: Return type of non-data records retrieved using MSG_PEEK in recvmsg
The patch enables returning 'type' in msghdr for records that are
retrieved with MSG_PEEK in recvmsg. Further it prevents records peeked
from socket from getting clubbed with any other record of different
type when records are subsequently dequeued from strparser.

For each record, we now retain its type in sk_buff's control buffer
cb[]. Inside control buffer, record's full length and offset are already
stored by strparser in 'struct strp_msg'. We store record type after
'struct strp_msg' inside 'struct tls_msg'. For tls1.2, the type is
stored just after record dequeue. For tls1.3, the type is stored after
record has been decrypted.

Inside process_rx_list(), before processing a non-data record, we check
that we must be able to return back the record type to the user
application. If not, the decrypted records in tls context's rx_list is
left there without consuming any data.

Fixes: 692d7b5d1f ("tls: Fix recvmsg() to be able to peek across multiple records")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 21:58:38 -08:00
Kefeng Wang
75efc250d2 ipv6: icmp: use percpu allocation
Use percpu allocation for the ipv6.icmp_sk.

Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 21:57:26 -08:00
David S. Miller
70f3522614 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Three conflicts, one of which, for marvell10g.c is non-trivial and
requires some follow-up from Heiner or someone else.

The issue is that Heiner converted the marvell10g driver over to
use the generic c45 code as much as possible.

However, in 'net' a bug fix appeared which makes sure that a new
local mask (MDIO_AN_10GBT_CTRL_ADV_NBT_MASK) with value 0x01e0
is cleared.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-24 12:06:19 -08:00
Cong Wang
14215108a1 net_sched: initialize net pointer inside tcf_exts_init()
For tcindex filter, it is too late to initialize the
net pointer in tcf_exts_validate(), as tcf_exts_get_net()
requires a non-NULL net pointer. We can just move its
initialization into tcf_exts_init(), which just requires
an additional parameter.

This makes the code in tcindex_alloc_perfect_hash()
prettier.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22 15:26:51 -08:00
David S. Miller
1a25660856 wireless-drivers-next patches for 5.1
Most likely the last set of patches for 5.1. WPA3 support to ath10k
 and qtnfmac. FTM support to iwlwifi and ath10k. And of course other
 new features and bugfixes.
 
 wireless-drivers was merged due to dependency in mt76.
 
 Major changes:
 
 iwlwifi
 
 * HE radiotap
 
 * FTM (Fine Timing Measurement) initiator and responder implementation
 
 * bump supported firmware API to 46
 
 * VHT extended NSS support
 
 * new PCI IDs for 9260 and 22000 series
 
 ath10k
 
 * change QMI interface to support the new (and backwards incompatible)
   interface from HL3.1 and used in recent HL2.0 branch firmware
   releases
 
 * support WPA3 with WCN3990
 
 * support for mac80211 airtime fairness based on transmit rate
   estimation, the firmware needs to support WMI_SERVICE_PEER_STATS to
   enable this
 
 * report transmit airtime to mac80211 with firmwares having
   WMI_SERVICE_REPORT_AIRTIME feature, this to have more accurate
   airtime fairness based on real transmit time (instead of just
   estimated from transmit rate)
 
 * support Fine Timing Measurement (FTM) responder role
 
 * add dynamic VLAN support with firmware having WMI_SERVICE_PER_PACKET_SW_ENCRYPT
 
 * switch to use SPDX license identifiers
 
 ath
 
 * add new country codes for US
 
 brcmfmac
 
 * support monitor frames with the hardware/ucode header
 
 qtnfmac
 
 * enable WPA3 SAE and OWE support
 
 mt76
 
 * beacon support for USB devices (mesh+ad-hoc only)
 
 rtlwifi
 
 * convert to use SPDX license identifiers
 
 libertas_tf
 
 * get the MAC address before registering the device
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQEcBAABAgAGBQJccAnlAAoJEG4XJFUm622bxxMH/2cz2GEuDQXOwfYLnCUwFES8
 vqTdMWnlfqGQvsTyvtSp91KqsL3hE2FB1Cu3n6/vzND3C7afT0V2/QlU7Pmgr8sR
 1gnJnu0NJT5SZfzcTsqwt8rFw9lQZ3HcvEcfNo6T5KUyyY7FGCNGR5H27nvoKOP3
 ea1h7U80loPV40sGTI8jj963wa9LNHUrjHTK4wY+KMWECgZuD/fHeE9YPB9MTdff
 lJ5bWCoQmuJddvTtC+X0tOAsCkxgdbbw6ieKqzep8H4gtUjerZD+V8uUdbtpr7IL
 sWPoaqrrADMv9cZtDVj+pEtNIJ3mpfuHYFKSON9jy/uxygeFTfVzLu43RFreDp8=
 =D15/
 -----END PGP SIGNATURE-----

Merge tag 'wireless-drivers-next-for-davem-2019-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/kvalo/wireless-drivers-next

Kalle Valo says:

====================
wireless-drivers-next patches for 5.1

Most likely the last set of patches for 5.1. WPA3 support to ath10k
and qtnfmac. FTM support to iwlwifi and ath10k. And of course other
new features and bugfixes.

wireless-drivers was merged due to dependency in mt76.

Major changes:

iwlwifi

* HE radiotap

* FTM (Fine Timing Measurement) initiator and responder implementation

* bump supported firmware API to 46

* VHT extended NSS support

* new PCI IDs for 9260 and 22000 series

ath10k

* change QMI interface to support the new (and backwards incompatible)
  interface from HL3.1 and used in recent HL2.0 branch firmware
  releases

* support WPA3 with WCN3990

* support for mac80211 airtime fairness based on transmit rate
  estimation, the firmware needs to support WMI_SERVICE_PEER_STATS to
  enable this

* report transmit airtime to mac80211 with firmwares having
  WMI_SERVICE_REPORT_AIRTIME feature, this to have more accurate
  airtime fairness based on real transmit time (instead of just
  estimated from transmit rate)

* support Fine Timing Measurement (FTM) responder role

* add dynamic VLAN support with firmware having WMI_SERVICE_PER_PACKET_SW_ENCRYPT

* switch to use SPDX license identifiers

ath

* add new country codes for US

brcmfmac

* support monitor frames with the hardware/ucode header

qtnfmac

* enable WPA3 SAE and OWE support

mt76

* beacon support for USB devices (mesh+ad-hoc only)

rtlwifi

* convert to use SPDX license identifiers

libertas_tf

* get the MAC address before registering the device
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-22 12:56:24 -08:00
Johannes Berg
b7b14ec1eb Merge remote-tracking branch 'net-next/master' into mac80211-next
Merge net-next to resolve a conflict and to get the mac80211
rhashtable fixes so further patches can be applied on top.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:48:13 +01:00
Johannes Berg
55c1fdf0d6 cfg80211: allow sending vendor events unicast
Sometimes, we may want to transport higher bandwidth data
through vendor events, and in that case sending it multicast
is a bad idea. Allow vendor events to be unicast.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:46:43 +01:00
Sara Sharon
fafd2bce5a mac80211: notify driver on subsequent CSA beacons
Some drivers may want to track further the CSA beacons, for example
to compensate for buggy APs that change the beacon count or quiet
mode during CSA flow.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:45:51 +01:00
Johannes Berg
2ff5e52e78 radiotap: add 0-length PSDU "not captured" type
This type was defined in radiotap but we didn't add it to the
header file, add it now.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:45:32 +01:00
Sara Sharon
b9cc81d827 mac80211: abort CSA if beacon does not include CSA IEs
In case we receive a beacon without CSA IE while we are in
the middle of channel switch - abort the operation.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:43:53 +01:00
Sara Sharon
ee145775c1 mac80211: support max channel switch time element
2018 REVmd of the spec introduces the max channel switch time
element which is optionally included in beacons/probes when there
is a channel switch / extended channel switch element.
The value represents the maximum delay between the time the AP
transmitted the last beacon in current channel and the expected
time of the first beacon in the new channel, in TU.

Parse the value and pass it to the driver.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:42:54 +01:00
Jouni Malinen
4d9ec73d2b cfg80211: Report Association Request frame IEs in association events
This extends the NL80211_CMD_ASSOCIATE event case to report
NL80211_ATTR_REQ_IE similarly to what is already done with the
NL80211_CMD_CONNECT events if the driver provides this information. In
practice, this adds (Re)Association Request frame information element
reporting to mac80211 drivers for the cases where user space SME is
used.

This provides more information for user space to figure out which
capabilities were negotiated for the association. For example, this can
be used to determine whether HT, VHT, or HE is used.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-22 13:35:09 +01:00
Arnd Bergmann
6321aa1975 phonet: fix building with clang
clang warns about overflowing the data[] member in the struct pnpipehdr:

net/phonet/pep.c:295:8: warning: array index 4 is past the end of the array (which contains 1 element) [-Warray-bounds]
                        if (hdr->data[4] == PEP_IND_READY)
                            ^         ~
include/net/phonet/pep.h:66:3: note: array 'data' declared here
                u8              data[1];

Using a flexible array member at the end of the struct avoids the
warning, but since we cannot have a flexible array member inside
of the union, each index now has to be moved back by one, which
makes it a little uglier.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Rémi Denis-Courmont <remi@remlab.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 16:23:56 -08:00
David S. Miller
b35560e485 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2019-02-21

1) Don't do TX bytes accounting for the esp trailer when sending
   from a request socket as this will result in an out of bounds
   memory write. From Martin Willi.

2) Destroy xfrm_state synchronously on net exit path to
   avoid nested gc flush callbacks that may trigger a
   warning in xfrm6_tunnel_net_exit(). From Cong Wang.

3) Do an unconditionally clone in pfkey_broadcast_one()
   to avoid a race when freeing the skb.
   From Sean Tranchetti.

4) Fix inbound traffic via XFRM interfaces across network
   namespaces. We did the lookup for interfaces and policies
   in the wrong namespace. From Tobias Brunner.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 16:08:52 -08:00
Florian Fainelli
010c8f01aa net: Get rid of switchdev_port_attr_get()
With the bridge no longer calling switchdev_port_attr_get() to obtain
the supported bridge port flags from a driver but instead trying to set
the bridge port flags directly and relying on driver to reject
unsupported configurations, we can effectively get rid of
switchdev_port_attr_get() entirely since this was the only place where
it was called.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 14:55:14 -08:00
Florian Fainelli
cc0c207a5d net: Remove SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT
Now that we have converted the bridge code and the drivers to check for
bridge port(s) flags at the time we try to set them, there is no need
for a get() -> set() sequence anymore and
SWITCHDEV_ATTR_ID_PORT_BRIDGE_FLAGS_SUPPORT therefore becomes unused.

Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 14:55:14 -08:00
Florian Fainelli
746dc184ba net: switchdev: Add PORT_PRE_BRIDGE_FLAGS
In preparation for removing switchdev_port_attr_get(), introduce
PORT_PRE_BRIDGE_FLAGS which will be called through
switchdev_port_attr_set(), in the caller's context (possibly atomic) and
which must be checked by the switchdev driver in order to return whether
the operation is supported or not.

This is entirely analoguous to how the BRIDGE_FLAGS_SUPPORT works,
except it goes through a set() instead of get().

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 14:55:13 -08:00
Russell King
57652796aa net: dsa: add support for bridge flags
The Linux bridge implementation allows various properties of the bridge
to be controlled, such as flooding unknown unicast and multicast frames.
This patch adds the necessary DSA infrastructure to allow the Linux
bridge support to control these properties for DSA switches.

Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk>
[florian: Add missing dp and ds variables declaration to fix build]
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 14:53:07 -08:00
Hans Wippel
f3d74b2245 net/smc: add smcd support to the pnet table
Currently, users can only set pnetids for netdevs and ib devices in the
pnet table. This patch adds support for smcd devices to the pnet table.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-21 10:34:37 -08:00
Vakul Garg
4509de1468 net/tls: Move protocol constants from cipher context to tls context
Each tls context maintains two cipher contexts (one each for tx and rx
directions). For each tls session, the constants such as protocol
version, ciphersuite, iv size, associated data size etc are same for
both the directions and need to be stored only once per tls context.
Hence these are moved from 'struct cipher_context' to 'struct
tls_prot_info' and stored only once in 'struct tls_context'.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-19 10:40:36 -08:00
David S. Miller
8bbed40f10 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for you net-next
tree:

1) Missing NFTA_RULE_POSITION_ID netlink attribute validation,
   from Phil Sutter.

2) Restrict matching on tunnel metadata to rx/tx path, from wenxu.

3) Avoid indirect calls for IPV6=y, from Florian Westphal.

4) Add two indirections to prepare merger of IPV4 and IPV6 nat
   modules, from Florian Westphal.

5) Broken indentation in ctnetlink, from Colin Ian King.

6) Patches to use struct_size() from netfilter and IPVS,
   from Gustavo A. R. Silva.

7) Display kernel splat only once in case of racing to confirm
   conntrack from bridge plus nfqueue setups, from Chieh-Min Wang.

8) Skip checksum validation for layer 4 protocols that don't need it,
   patch from Alin Nastac.

9) Sparse warning due to symbol that should be static in CLUSTERIP,
   from Wei Yongjun.

10) Add new toggle to disable SDP payload translation when media
    endpoint is reachable though the same interface as the signalling
    peer, from Alin Nastac.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-18 11:38:30 -08:00
Jakub Kicinski
4eceba1720 ethtool: add compat for flash update
If driver does not support ethtool flash update operation
call into devlink.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17 15:27:38 -08:00
Jakub Kicinski
76726ccb7f devlink: add flash update command
Add devlink flash update command. Advanced NICs have firmware
stored in flash and often cryptographically secured. Updating
that flash is handled by management firmware. Ethtool has a
flash update command which served us well, however, it has two
shortcomings:
 - it takes rtnl_lock unnecessarily - really flash update has
   nothing to do with networking, so using a networking device
   as a handle is suboptimal, which leads us to the second one:
 - it requires a functioning netdev - in case device enters an
   error state and can't spawn a netdev (e.g. communication
   with the device fails) there is no netdev to use as a handle
   for flashing.

Devlink already has the ability to report the firmware versions,
now with the ability to update the firmware/flash we will be
able to recover devices in bad state.

To enable updates of sub-components of the FW allow passing
component name.  This name should correspond to one of the
versions reported in devlink info.

v1: - replace target id with component name (Jiri).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-17 15:27:38 -08:00
David S. Miller
885e631959 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-02-16

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) numerous libbpf API improvements, from Andrii, Andrey, Yonghong.

2) test all bpf progs in alu32 mode, from Jiong.

3) skb->sk access and bpf_sk_fullsock(), bpf_tcp_sock() helpers, from Martin.

4) support for IP encap in lwt bpf progs, from Peter.

5) remove XDP_QUERY_XSK_UMEM dead code, from Jan.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-16 22:56:34 -08:00
David S. Miller
3313da8188 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
The netfilter conflicts were rather simple overlapping
changes.

However, the cls_tcindex.c stuff was a bit more complex.

On the 'net' side, Cong is fixing several races and memory
leaks.  Whilst on the 'net-next' side we have Vlad adding
the rtnl-ness support.

What I've decided to do, in order to resolve this, is revert the
conversion over to using a workqueue that Cong did, bringing us back
to pure RCU.  I did it this way because I believe that either Cong's
races don't apply with have Vlad did things, or Cong will have to
implement the race fix slightly differently.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-15 12:38:38 -08:00
John Hurley
9f9dc493f7 flow_offload: fix block stats
With the introduction of flow_stats_update(), drivers now update the stats
fields of the passed tc_cls_flower_offload struct, rather than call
tcf_exts_stats_update() directly to update the stats of offloaded TC
flower rules. However, if multiple qdiscs are registered to a TC shared
block and a flower rule is applied, then, when getting stats for the rule,
multiple callbacks may be made.

Take this into consideration by modifying flow_stats_update to gather the
stats from all callbacks. Currently, the values in tc_cls_flower_offload
only account for the last stats callback in the list.

Fixes: 3b1903ef97 ("flow_offload: add statistics retrieval infrastructure and use it")
Signed-off-by: John Hurley <john.hurley@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13 21:07:47 -08:00
Eric Dumazet
5bf325a532 net: fix possible overflow in __sk_mem_raise_allocated()
With many active TCP sockets, fat TCP sockets could fool
__sk_mem_raise_allocated() thanks to an overflow.

They would increase their share of the memory, instead
of decreasing it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-13 21:05:18 -08:00
Peter Oskolkov
9b0a6a9dba ipv6_stub: add ipv6_route_input stub/proxy.
Proxy ip6_route_input via ipv6_stub, for later use by lwt bpf ip encap
(see the next patch in the patchset).

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13 18:27:55 -08:00
Peter Oskolkov
52f278774e bpf: implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap
Implement BPF_LWT_ENCAP_IP mode in bpf_lwt_push_encap BPF helper.
It enables BPF programs (specifically, BPF_PROG_TYPE_LWT_IN and
BPF_PROG_TYPE_LWT_XMIT prog types) to add IP encapsulation headers
to packets (e.g. IP/GRE, GUE, IPIP).

This is useful when thousands of different short-lived flows should be
encapped, each with different and dynamically determined destination.
Although lwtunnels can be used in some of these scenarios, the ability
to dynamically generate encap headers adds more flexibility, e.g.
when routing depends on the state of the host (reflected in global bpf
maps).

v7 changes:
 - added a call skb_clear_hash();
 - removed calls to skb_set_transport_header();
 - refuse to encap GSO-enabled packets.

v8 changes:
 - fix build errors when LWT is not enabled.

Note: the next patch in the patchset with deal with GSO-enabled packets,
which are currently rejected at encapping attempt.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-13 18:27:55 -08:00
Alin Nastac
7fc3822536 netfilter: reject: skip csum verification for protocols that don't support it
Some protocols have other means to verify the payload integrity
(AH, ESP, SCTP) while others are incompatible with nf_ip(6)_checksum
implementation because checksum is either optional or might be
partial (UDPLITE, DCCP, GRE). Because nf_ip(6)_checksum was used
to validate the packets, ip(6)tables REJECT rules were not capable
to generate ICMP(v6) errors for the protocols mentioned above.

This commit also fixes the incorrect pseudo-header protocol used
for IPv4 packets that carry other transport protocols than TCP or
UDP (pseudo-header used protocol 0 iso the proper value).

Signed-off-by: Alin Nastac <alin.nastac@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-13 10:03:53 +01:00
Vlad Buslov
dfcd2a2b22 net: sched: add flags to Qdisc class ops struct
Extend Qdisc_class_ops with flags. Create enum to hold possible class ops
flag values. Add first class ops flags value QDISC_CLASS_OPS_DOIT_UNLOCKED
to indicate that class ops functions can be called without taking rtnl
lock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
12db03b65c net: sched: extend proto ops to support unlocked classifiers
Add 'rtnl_held' flag to tcf proto change, delete, destroy, dump, walk
functions to track rtnl lock status. Extend users of these function in cls
API to propagate rtnl lock status to them. This allows classifiers to
obtain rtnl lock when necessary and to pass rtnl lock status to extensions
and driver offload callbacks.

Add flags field to tcf proto ops. Add flag value to indicate that
classifier doesn't require rtnl lock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
7d5509fa0d net: sched: extend proto ops with 'put' callback
Add optional tp->ops->put() API to be implemented for filter reference
counting. This new function is called by cls API to release filter
reference for filters returned by tp->ops->change() or tp->ops->get()
functions. Implement tfilter_put() helper to call tp->ops->put() only for
classifiers that implement it.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
ec6743a109 net: sched: track rtnl lock status when validating extensions
Actions API is already updated to not rely on rtnl lock for
synchronization. However, it need to be provided with rtnl status when
called from classifiers API in order to be able to correctly release the
lock when loading kernel module.

Extend extension validation function with 'rtnl_held' flag which is passed
to actions API. Add new 'rtnl_held' parameter to tcf_exts_validate() in cls
API. No classifier is currently updated to support unlocked execution, so
pass hardcoded 'true' flag parameter value.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
726d061286 net: sched: prevent insertion of new classifiers during chain flush
Extend tcf_chain with 'flushing' flag. Use the flag to prevent insertion of
new classifier instances when chain flushing is in progress in order to
prevent resource leak when tcf_proto is created by unlocked users
concurrently.

Return EAGAIN error from tcf_chain_tp_insert_unique() to restart
tc_new_tfilter() and lookup the chain/proto again.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
8b64678e0a net: sched: refactor tp insert/delete for concurrent execution
Implement unique insertion function to atomically attach tcf_proto to chain
after verifying that no other tcf proto with specified priority exists.
Implement delete function that verifies that tp is actually empty before
deleting it. Use these functions to refactor cls API to account for
concurrent tp and rule update instead of relying on rtnl lock. Add new
'deleting' flag to tcf proto. Use it to restart search when iterating over
tp's on chain to prevent accessing potentially inval tp->next pointer.

Extend tcf proto with spinlock that is intended to be used to protect its
data from concurrent modification instead of relying on rtnl mutex. Use it
to protect 'deleting' flag. Add lockdep macros to validate that lock is
held when accessing protected fields.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
fe2923afc1 net: sched: traverse classifiers in chain with tcf_get_next_proto()
All users of chain->filters_chain rely on rtnl lock and assume that no new
classifier instances are added when traversing the list. Use
tcf_get_next_proto() to traverse filters list without relying on rtnl
mutex. This function iterates over classifiers by taking reference to
current iterator classifier only and doesn't assume external
synchronization of filters list.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:33 -05:00
Vlad Buslov
4dbfa76644 net: sched: introduce reference counting for tcf_proto
In order to remove dependency on rtnl lock and allow concurrent tcf_proto
modification, extend tcf_proto with reference counter. Implement helper
get/put functions for tcf proto and use them to modify cls API to always
take reference to tcf_proto while using it. Only release reference to
parent chain after releasing last reference to tp.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:32 -05:00
Vlad Buslov
ed76f5edcc net: sched: protect filter_chain list with filter_chain_lock mutex
Extend tcf_chain with new filter_chain_lock mutex. Always lock the chain
when accessing filter_chain list, instead of relying on rtnl lock.
Dereference filter_chain with tcf_chain_dereference() lockdep macro to
verify that all users of chain_list have the lock taken.

Rearrange tp insert/remove code in tc_new_tfilter/tc_del_tfilter to execute
all necessary code while holding chain lock in order to prevent
invalidation of chain_info structure by potential concurrent change. This
also serializes calls to tcf_chain0_head_change(), which allows head change
callbacks to rely on filter_chain_lock for synchronization instead of rtnl
mutex.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:32 -05:00
Vlad Buslov
bbf73830cd net: sched: traverse chains in block with tcf_get_next_chain()
All users of block->chain_list rely on rtnl lock and assume that no new
chains are added when traversing the list. Use tcf_get_next_chain() to
traverse chain list without relying on rtnl mutex. This function iterates
over chains by taking reference to current iterator chain only and doesn't
assume external synchronization of chain list.

Don't take reference to all chains in block when flushing and use
tcf_get_next_chain() to safely iterate over chain list instead. Remove
tcf_block_put_all_chains() that is no longer used.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:32 -05:00
Vlad Buslov
c266f64dbf net: sched: protect block state with mutex
Currently, tcf_block doesn't use any synchronization mechanisms to protect
critical sections that manage lifetime of its chains. block->chain_list and
multiple variables in tcf_chain that control its lifetime assume external
synchronization provided by global rtnl lock. Converting chain reference
counting to atomic reference counters is not possible because cls API uses
multiple counters and flags to control chain lifetime, so all of them must
be synchronized in chain get/put code.

Use single per-block lock to protect block data and manage lifetime of all
chains on the block. Always take block->lock when accessing chain_list.
Chain get and put modify chain lifetime-management data and parent block's
chain_list, so take the lock in these functions. Verify block->lock state
with assertions in functions that expect to be called with the lock taken
and are called from multiple places. Take block->lock when accessing
filter_chain_list.

In order to allow parallel update of rules on single block, move all calls
to classifiers outside of critical sections protected by new block->lock.
Rearrange chain get and put functions code to only access protected chain
data while holding block lock:
- Rearrange code to only access chain reference counter and chain action
  reference counter while holding block lock.
- Extract code that requires block->lock from tcf_chain_destroy() into
  standalone tcf_chain_destroy() function that is called by
  __tcf_chain_put() in same critical section that changes chain reference
  counters.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:32 -05:00
Vasundhara Volam
da203dfa89 Revert "devlink: Add a generic wake_on_lan port parameter"
This reverts commit b639583f9e.

As per discussion with Jakub Kicinski and Michal Kubecek,
this will be better addressed by soon-too-come ethtool netlink
API with additional indication that given configuration request
is supposed to be persisted.

Also, remove the parameter support from bnxt_en driver.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 12:13:22 -05:00
Eli Britstein
6663cf821c flow_offload: Fix flow action infrastructure
Implementation of macro "flow_action_for_each" introduced in
commit e3ab786b42 ("flow_offload: add flow action infrastructure")
and used in commit 7386788175 ("drivers: net: use flow action
infrastructure") iterated the first item twice and did not reach the
last one. Fix it.

Fixes: e3ab786b42 ("flow_offload: add flow action infrastructure")
Fixes: 7386788175 ("drivers: net: use flow action infrastructure")
Signed-off-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 21:30:14 -08:00
Jakub Kicinski
14fd1901e7 devlink: add a generic board.manufacture version name
At Jiri's suggestion add a generic "board.manufacture"
version identifier.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 20:39:56 -08:00
Johannes Berg
5d4071abd9 cfg80211: fix and clean up cfg80211_gen_new_bssid()
Fix cfg80211_gen_new_bssid() to not rely on u64 modulo arithmetic,
which isn't needed since we really just want to mask there. Also,
clean it up to calculate the mask only once and use GENMASK_ULL()
instead of open-coding the mask calculation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11 16:54:08 +01:00
Eli Cohen
eddd2cf195 net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE
Modify the kernel users of the TCA_ACT_* macros to use TCA_ID_*. For
example, use TCA_ID_GACT instead of TCA_ACT_GACT. This will align with
TCA_ID_POLICE and also differentiates these identifier, used in struct
tc_action_ops type field, from other macros starting with TCA_ACT_.

To make things clearer, we name the enum defining the TCA_ID_*
identifiers and also change the "type" field of struct tc_action to
id.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-10 09:28:43 -08:00
Lorenzo Bianconi
c09551c6ff net: ipv4: use a dedicated counter for icmp_v4 redirect packets
According to the algorithm described in the comment block at the
beginning of ip_rt_send_redirect, the host should try to send
'ip_rt_redirect_number' ICMP redirect packets with an exponential
backoff and then stop sending them at all assuming that the destination
ignores redirects.
If the device has previously sent some ICMP error packets that are
rate-limited (e.g TTL expired) and continues to receive traffic,
the redirect packets will never be transmitted. This happens since
peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
and so it will never be reset even if the redirect silence timeout
(ip_rt_redirect_silence) has elapsed without receiving any packet
requiring redirects.

Fix it by using a dedicated counter for the number of ICMP redirect
packets that has been sent by the host

I have not been able to identify a given commit that introduced the
issue since ip_rt_send_redirect implements the same rate-limiting
algorithm from commit 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 21:50:15 -08:00
Jiri Pirko
7c62cfb8c5 devlink: publish params only after driver init is done
Currently, user can do dump or get of param values right after the
devlink params are registered. However the driver may not be initialized
which is an issue. The same problem happens during notification
upon param registration. Allow driver to publish devlink params
whenever it is ready to handle get() ops. Note that this cannot
be resolved by init reordering, as the "driverinit" params have
to be available before the driver is initialized (it needs the param
values there).

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:02:49 -08:00
David S. Miller
a655fe9f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.

Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:00:17 -08:00
Johannes Berg
851ae31d34 cfg80211: add missing kernel-doc for multi-BSSID fields
Add the missing kernel-doc for the new multi-BSSID fields
in struct cfg80211_bss.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 14:12:25 +01:00
Sara Sharon
caf56338c2 mac80211: indicate support for multiple BSSID
Set multi-bssid support flags according to driver support.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:56:37 +01:00
Sara Sharon
78ac51f815 mac80211: support multi-bssid
Add support for multi-bssid.

This includes:
- Parsing multi-bssid element
- Overriding DTIM values
- Taking into account in various places the inner BSSID instead of
  transmitter BSSID
- Save aside some multi-bssid properties needed by drivers

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:56:37 +01:00
Sara Sharon
0cd01efb03 cfg80211: save multi-bssid properties
When the new IEs are generated, the multiple BSSID elements
are not saved. Save aside properties that are needed later
for PS.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:51:50 +01:00
Sara Sharon
7ece9c372b cfg80211: make BSSID generation function inline
This will enable reuse by mac80211.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:51:50 +01:00
Sara Sharon
213ed579d3 cfg80211: parse multi-bssid only if HW supports it
Parsing and exposing nontransmitted APs is problematic
when underlying HW doesn't support it. Do it only if
driver indicated support. Allow HE restriction as well,
since the HE spec defined the exact manner that Multiple
BSSID set should behave. APs that not support the HE
spec will have less predictable Multiple BSSID set
support/behavior

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:51:50 +01:00
Sara Sharon
7011ba583f cfg80211: Move Multiple BSS info to struct cfg80211_bss to be visible
Previously the transmitted BSS and the non-trasmitted BSS list were
defined in struct cfg80211_internal_bss. Move them to struct cfg80211_bss
since mac80211 needs this info.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:51:50 +01:00
Johannes Berg
49a68e0d88 cfg80211: add various struct element finding helpers
We currently have a number of helpers to find elements that just
return a u8 *, change those to return a struct element and add
inlines to deal with the u8 * compatibility.

Note that the match behaviour is changed to start the natch at
the data, so conversion from _ie_match to _elem_match need to
be done carefully.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-08 13:51:50 +01:00
Eran Ben Elisha
c8e1da0bf9 devlink: Add health report functionality
Upon error discover, every driver can report it to the devlink health
mechanism via devlink_health_report function, using the appropriate
reporter registered to it. Driver can pass error specific context which
will be delivered to it as part of the dump / recovery callbacks.

Once an error is reported, devlink health will do the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and stored at the reporter instance (as long
  as there is no other dump which is already stored)
* Auto recovery attempt is being done. Depends on:
  - Auto Recovery configuration
  - Grace period vs. Time since last recover

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
a0bdcc59d1 devlink: Add health reporter create/destroy functionality
Devlink health reporter is an instance for reporting, diagnosing and
recovering from run time errors discovered by the reporters.
Define it's data structure and supported operations.
In addition, expose devlink API to create and destroy a reporter.
Each devlink instance will hold it's own reporters list.

As part of the allocation, driver shall provide a set of callbacks which
will be used by devlink in order to handle health reports and user
commands related to this reporter. In addition, driver is entitled to
provide some priv pointer, which can be fetched from the reporter by
devlink_health_reporter_priv function.

For each reporter, devlink will hold a metadata of statistics,
dump msg and status.

For passing dumps and diagnose data to the user-space, it will use devlink
fmsg API.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
1db64e8733 devlink: Add devlink formatted message (fmsg) API
Devlink fmsg is a mechanism to pass descriptors between drivers and
devlink, in json-like format. The API allows the driver to add nested
attributes such as object, object pair and value array, in addition to
attributes such as name and value.

Driver can use this API to fill the fmsg context in a format which will be
translated by the devlink to the netlink message later.
There is no memory allocation in advance (other than the initial list
head), and it dynamically allocates messages descriptors and add them to
the list on the fly.

When it needs to send the data using SKBs to the netlink layer, it
fragments the data between different SKBs. In order to do this
fragmentation, it uses virtual nests attributes, to avoid actual
nesting use which cannot be divided between different SKBs.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Kalle Valo
3479f74ee4 Third batch of iwlwifi patches intended for v5.1
* Work on the new debugging infrastructure continues;
 * HE radiotap;
 * Support for new FW version 44;
 * A couple of new FW API changes;
 * A bunch of fixes for static analyzer reported issues;
 * General bugfixes;
 * Other cleanups and small fixes;
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEF3LNfgb2BPWm68smoUecoho8xfoFAlxYFbcACgkQoUecoho8
 xfoFhg//eJLoosJx5BIX7vJ0b4uUJ7gjTj67qMja7RBVUXxMfYcn7Yrztlenm+H7
 yIsZe7I0Jap88WH3HKYU/G6ASFiyXZo6TrUt4rzY3Xuy3SIgSG5gmnt4XcQRSSBd
 mYp+hjmz5PJPx2lzRGccQ167oOQZ/DHLn7JwuuLmgtLfz4RMHpUtitOQf9WGlKx4
 nure2JLFZ4yV+lng6XBPma/lelgi9q8L8bu7izOhJkh0saSDlSWQUcn5kWoWG5av
 syQsrxb3FH6wfZijZXW4USolpThgCcXxTzd0IFPPXIyx/z6PEZK0yBNvGeiiPxm9
 bWT1fJGrSea+82qY2vTVE1NLKd46S8jATSxSawqwGFQRv7EeLW1IdDzpFlbqU+Rn
 1dUaOHzIUtK6MdCzNco4rcYZlvmFnMlqROQexnCp/sNp1J+eOfL1aqog3wRZc7sN
 IAtZJbwMcLC1YHunMhFOCs1+imXVrhBBN8tEJGoQHtBvIQpTumX6VFGeGBmXiyVX
 FLJLF0aeYg970OMcwqNuMzNIcSk13uhw3F/M/jIIN8IyK3L3bX6kiaUevUMgYK3T
 +3zPj41hJWGFelLa7FNryqt1S4Vbi12EubuK1gv1QcpvXro9OsUEXzeJw7PWEPj0
 r3JBiIIkfpFul88F6a3163dNQoo03k4Nd69dNww/RdD7+GGuZWU=
 =ckmd
 -----END PGP SIGNATURE-----

Merge tag 'iwlwifi-next-for-kalle-2019-02-04' of git://git.kernel.org/pub/scm/linux/kernel/git/iwlwifi/iwlwifi-next

Third batch of iwlwifi patches intended for v5.1

* Work on the new debugging infrastructure continues;
* HE radiotap;
* Support for new FW version 44;
* A couple of new FW API changes;
* A bunch of fixes for static analyzer reported issues;
* General bugfixes;
* Other cleanups and small fixes;
2019-02-07 11:34:26 +02:00
Florian Fainelli
bccb30254a net: Get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID
Now that we have a dedicated NDO for getting a port's parent ID, get rid
of SWITCHDEV_ATTR_ID_PORT_PARENT_ID and convert all callers to use the
NDO exclusively. This is a preliminary change to getting rid of
switchdev_ops eventually.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 14:17:03 -08:00
Pablo Neira Ayuso
8bec2833fb flow_offload: add wake-up-on-lan and queue to flow_action
These actions need to be added to support the ethtool_rx_flow interface.
The queue action includes a field to specify the RSS context, that is
set via FLOW_RSS flow type flag and the rss_context field in struct
ethtool_rxnfc, plus the corresponding queue index. FLOW_RSS implies that
rss_context is non-zero, therefore, queue.ctx == 0 means that FLOW_RSS
was not set. Also add a field to store the vf index which is stored in
the ethtool_rxnfc ring_cookie field.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:26 -08:00
Pablo Neira Ayuso
2cd173e6d5 cls_flower: don't expose TC actions to drivers anymore
Now that drivers have been converted to use the flow action
infrastructure, remove this field from the tc_cls_flower_offload
structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:26 -08:00
Pablo Neira Ayuso
3b1903ef97 flow_offload: add statistics retrieval infrastructure and use it
This patch provides the flow_stats structure that acts as container for
tc_cls_flower_offload, then we can use to restore the statistics on the
existing TC actions. Hence, tcf_exts_stats_update() is not used from
drivers anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Pablo Neira Ayuso
3a7b68617d cls_api: add translator to flow_action representation
This patch implements a new function to translate from native TC action
to the new flow_action representation. Moreover, this patch also updates
cls_flower to use this new function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Pablo Neira Ayuso
e3ab786b42 flow_offload: add flow action infrastructure
This new infrastructure defines the nic actions that you can perform
from existing network drivers. This infrastructure allows us to avoid a
direct dependency with the native software TC action representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Pablo Neira Ayuso
8f2566225a flow_offload: add flow_rule and flow_match structures and use them
This patch wraps the dissector key and mask - that flower uses to
represent the matching side - around the flow_match structure.

To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.

This introduces two new interfaces:

	bool flow_rule_match_key(rule, dissector_id)

that returns true if a given matching key is set on, and:

	flow_rule_match_XYZ(rule, &match);

To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Cong Wang
f75a2804da xfrm: destroy xfrm_state synchronously on net exit path
xfrm_state_put() moves struct xfrm_state to the GC list
and schedules the GC work to clean it up. On net exit call
path, xfrm_state_flush() is called to clean up and
xfrm_flush_gc() is called to wait for the GC work to complete
before exit.

However, this doesn't work because one of the ->destructor(),
ipcomp_destroy(), schedules the same GC work again inside
the GC work. It is hard to wait for such a nested async
callback. This is also why syzbot still reports the following
warning:

 WARNING: CPU: 1 PID: 33 at net/ipv6/xfrm6_tunnel.c:351 xfrm6_tunnel_net_exit+0x2cb/0x500 net/ipv6/xfrm6_tunnel.c:351
 ...
  ops_exit_list.isra.0+0xb0/0x160 net/core/net_namespace.c:153
  cleanup_net+0x51d/0xb10 net/core/net_namespace.c:551
  process_one_work+0xd0c/0x1ce0 kernel/workqueue.c:2153
  worker_thread+0x143/0x14a0 kernel/workqueue.c:2296
  kthread+0x357/0x430 kernel/kthread.c:246
  ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352

In fact, it is perfectly fine to bypass GC and destroy xfrm_state
synchronously on net exit call path, because it is in process context
and doesn't need a work struct to do any blocking work.

This patch introduces xfrm_state_put_sync() which simply bypasses
GC, and lets its callers to decide whether to use this synchronous
version. On net exit path, xfrm_state_fini() and
xfrm6_tunnel_net_exit() use it. And, as ipcomp_destroy() itself is
blocking, it can use xfrm_state_put_sync() directly too.

Also rename xfrm_state_gc_destroy() to ___xfrm_state_destroy() to
reflect this change.

Fixes: b48c05ab5d ("xfrm: Fix warning in xfrm6_tunnel_net_exit.")
Reported-and-tested-by: syzbot+e9aebef558e3ed673934@syzkaller.appspotmail.com
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2019-02-05 06:29:20 +01:00
Pablo Neira Ayuso
f6ac858589 netfilter: nf_tables: unbind set in rule from commit path
Anonymous sets that are bound to rules from the same transaction trigger
a kernel splat from the abort path due to double set list removal and
double free.

This patch updates the logic to search for the transaction that is
responsible for creating the set and disable the set list removal and
release, given the rule is now responsible for this. Lookup is reverse
since the transaction that adds the set is likely to be at the tail of
the list.

Moreover, this patch adds the unbind step to deliver the event from the
commit path.  This should not be done from the worker thread, since we
have no guarantees of in-order delivery to the listener.

This patch removes the assumption that both activate and deactivate
callbacks need to be provided.

Fixes: cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 17:29:17 +01:00
Johannes Berg
74cf15cb69 iwlwifi: mvm: add HE TB PPDU SIG-A BW to radiotap
Expose the trigger-based PPDU SIG-A bandwidth to radiotap in
the newly defined bits thereof.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
2019-02-04 12:28:06 +02:00
Jakub Kicinski
bff5731d43 net: devlink: report cell size of shared buffers
Shared buffer allocation is usually done in cell increments.
Drivers will either round up the allocation or refuse the
configuration if it's not an exact multiple of cell size.
Drivers know exactly the cell size of shared buffer, so help
out users by providing this information in dumps.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:25:34 -08:00
Deepa Dinamani
887feae36a socket: Add SO_TIMESTAMP[NS]_NEW
Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of
socket timestamp options.
These are the y2038 safe versions of the SO_TIMESTAMP_OLD
and SO_TIMESTAMPNS_OLD for all architectures.

Note that the format of scm_timestamping.ts[0] is not changed
in this patch.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: jejb@parisc-linux.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: linux-alpha@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:31 -08:00
Jakub Kicinski
ddb6e99e2d ethtool: add compat for devlink info
If driver did not fill the fw_version field, try to call into
the new devlink get_info op and collect the versions that way.
We assume ethtool was always reporting running versions.

v4:
 - use IS_REACHABLE() to avoid problems with DEVLINK=m (kbuildbot).
v3 (Jiri):
 - do a dump and then parse it instead of special handling;
 - concatenate all versions (well, all that fit :)).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:30:31 -08:00
Jakub Kicinski
785bd550c4 devlink: add generic info version names
Add defines and docs for generic info versions.

v3:
 - add docs;
 - separate patch (Jiri).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:30:30 -08:00
Jakub Kicinski
fc6fae7dd9 devlink: add version reporting to devlink info API
ethtool -i has a few fixed-size fields which can be used to report
firmware version and expansion ROM version. Unfortunately, modern
hardware has more firmware components. There is usually some
datapath microcode, management controller, PXE drivers, and a
CPLD load. Running ethtool -i on modern controllers reveals the
fact that vendors cram multiple values into firmware version field.

Here are some examples from systems I could lay my hands on quickly:

tg3:  "FFV20.2.17 bc 5720-v1.39"
i40e: "6.01 0x800034a4 1.1747.0"
nfp:  "0.0.3.5 0.25 sriov-2.1.16 nic"

Add a new devlink API to allow retrieving multiple versions, and
provide user-readable name for those versions.

While at it break down the versions into three categories:
 - fixed - this is the board/fixed component version, usually vendors
           report information like the board version in the PCI VPD,
           but it will benefit from naming and common API as well;
 - running - this is the running firmware version;
 - stored - this is firmware in the flash, after firmware update
            this value will reflect the flashed version, while the
            running version may only be updated after reboot.

v3:
 - add per-type helpers instead of using the special argument (Jiri).
RFCv2:
 - remove the nesting in attr DEVLINK_ATTR_INFO_VERSIONS (now
   versions are mixed with other info attrs)l
 - have the driver report versions from the same callback as
   other info.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:30:30 -08:00
Jakub Kicinski
f9cf22882c devlink: add device information API
ethtool -i has served us well for a long time, but its showing
its limitations more and more. The device information should
also be reported per device not per-netdev.

Lay foundation for a simple devlink-based way of reading device
info. Add driver name and device serial number as initial pieces
of information exposed via this new API.

v3:
 - rename helpers (Jiri);
 - rename driver name attr (Jiri);
 - remove double spacing in commit message (Jiri).
RFC v2:
 - wrap the skb into an opaque structure (Jiri);
 - allow the serial number of be any length (Jiri & Andrew);
 - add driver name (Jonathan).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:30:30 -08:00
Dave Watson
5b053e121f net: tls: Set async_capable for tls zerocopy only if we see EINPROGRESS
Currently we don't zerocopy if the crypto framework async bit is set.
However some crypto algorithms (such as x86 AESNI) support async,
but in the context of sendmsg, will never run asynchronously.  Instead,
check for actual EINPROGRESS return code before assuming algorithm is
async.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:05:07 -08:00
Dave Watson
130b392c6c net: tls: Add tls 1.3 support
TLS 1.3 has minor changes from TLS 1.2 at the record layer.

* Header now hardcodes the same version and application content type in
  the header.
* The real content type is appended after the data, before encryption (or
  after decryption).
* The IV is xored with the sequence number, instead of concatinating four
  bytes of IV with the explicit IV.
* Zero-padding:  No exlicit length is given, we search backwards from the
  end of the decrypted data for the first non-zero byte, which is the
  content type.  Currently recv supports reading zero-padding, but there
  is no way for send to add zero padding.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:00:55 -08:00
Dave Watson
a2ef9b6a22 net: tls: Refactor tls aad space size calculation
TLS 1.3 has a different AAD size, use a variable in the code to
make TLS 1.3 support easy.

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:00:55 -08:00
Dave Watson
fb99bce712 net: tls: Support 256 bit keys
Wire up support for 256 bit keys from the setsockopt to the crypto
framework

Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:00:55 -08:00
Johannes Berg
7d4194633b mac80211: fix missing/malformed documentation
Fix the missing and malformed documentation that kernel-doc and
sphinx warn about. While at it, also add some things to the docs
to fix missing links.

Sadly, the only way I could find to fix this was to add some
trailing whitespace.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-01 12:11:13 +01:00
Johannes Berg
9874b71fa1 cfg80211: add missing documentation that kernel-doc warns about
Add the missing documentation that kernel-doc continually warns
about, to get rid of all that noise.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-01 11:53:15 +01:00
Johannes Berg
23323289b1 netlink: reduce NLA_POLICY_NESTED{,_ARRAY} arguments
In typical cases, there's no need to pass both the maxattr
and the policy array pointer, as the maxattr should just be
ARRAY_SIZE(policy) - 1. Therefore, to be less error prone,
just remove the maxattr argument from the default macros
and deduce the size accordingly.

Leave the original macros with a leading underscore to use
here and in case somebody needs to pass a policy pointer
where the policy isn't declared in the same place and thus
ARRAY_SIZE() cannot be used.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-01 11:06:55 +01:00
Johannes Berg
752cfee90d Merge remote-tracking branch 'net-next/master' into mac80211-next
Merge net-next so that we get the changes from net, which would
otherwise conflict with the NLA_POLICY_NESTED/_ARRAY changes.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-01 11:05:35 +01:00
Matteo Croce
5ac4a12df5 cfg80211: fix typo
Fix spelling mistake in cfg80211.h: "lenght" -> "length".
The typo is also in the special comment block which
translates to documentation.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-01 11:05:07 +01:00
Toke Høiland-Jørgensen
cb86880ee4 mac80211: Fix documentation strings for airtime-related variables
There was a typo in the documentation for weight_multiplier in mac80211.h,
and the doc was missing entirely for airtime and airtime_weight in sta_info.h.

Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-01 11:04:53 +01:00
Daniel Borkmann
d5256083f6 ipvlan, l3mdev: fix broken l3s mode wrt local routes
While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit
f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net->loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb->dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev->priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

  [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c2 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Martynas Pumputis <m@lambda.lt>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30 22:13:34 -08:00
Xin Long
7efba10d6b sctp: add SCTP_FUTURE_ASOC and SCTP_CURRENT_ASSOC for SCTP_STREAM_SCHEDULER sockopt
Check with SCTP_ALL_ASSOC instead in sctp_setsockopt_scheduler and
check with SCTP_FUTURE_ASSOC instead in sctp_getsockopt_scheduler,
it's compatible with 0.

SCTP_CURRENT_ASSOC is supported for SCTP_STREAM_SCHEDULER in this
patch. It also adds default_ss in sctp_sock to support
SCTP_FUTURE_ASSOC.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30 00:44:08 -08:00
Xin Long
8add543e36 sctp: add SCTP_FUTURE_ASSOC for SCTP_PEER_ADDR_THLDS sockopt
Check with SCTP_FUTURE_ASSOC instead in
sctp_set/getsockopt_paddr_thresholds, it's compatible with 0.

It also adds pf_retrans in sctp_sock to support SCTP_FUTURE_ASSOC.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-30 00:44:06 -08:00
Vasundhara Volam
b639583f9e devlink: Add a generic wake_on_lan port parameter
wake_on_lan - Enables Wake on Lan for this port. If enabled,
the controller asserts a wake pin based on the WOL type.

v2->v3:
- Define only WOL types used now and define them as bitfield, so that
  mutliple WOL types can be enabled upon power on.
- Modify "wake-on-lan" name to "wake_on_lan" to be symmetric with
  previous definitions.
- Rename DEVLINK_PARAM_WOL_XXX to DEVLINK_PARAM_WAKE_XXX to be
  symmetrical with ethtool WOL definitions.

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29 22:13:09 -08:00
Vasundhara Volam
c1e5786d67 devlink: Add devlink notifications support for port params
Add notification call for devlink port param set, register and unregister
functions.
Add devlink_port_param_value_changed() function to enable the driver notify
devlink on value change. Driver should use this function after value was
changed on any configuration mode part to driverinit.

v7->v8:
Order devlink_port_param_value_changed() definitions followed by
devlink_param_value_changed()

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29 22:13:09 -08:00
Vasundhara Volam
5473a7bdad devlink: Add support for driverinit set value for devlink_port
Add support for "driverinit" configuration mode value for devlink_port
configuration parameters. Add devlink_port_param_driverinit_value_set()
function to help the driver set the value to devlink_port.

Also, move the common code to __devlink_param_driverinit_value_set()
to be used by both device and port params.

v7->v8:
Re-order the definitions as follows:
__devlink_param_driverinit_value_get
__devlink_param_driverinit_value_set
devlink_param_driverinit_value_get
devlink_param_driverinit_value_set
devlink_port_param_driverinit_value_get
devlink_port_param_driverinit_value_set

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29 22:13:09 -08:00
Vasundhara Volam
ffd19b9a49 devlink: Add support for driverinit get value for devlink_port
Add support for "driverinit" configuration mode value for devlink_port
configuration parameters. Add devlink_port_param_driverinit_value_get()
function to help the driver get the value from devlink_port.

Also, move the common code to __devlink_param_driverinit_value_get()
to be used by both device and port params.

v7->v8:
-Add the missing devlink_port_param_driverinit_value_get() declaration.
-Also, order devlink_port_param_driverinit_value_get() after
devlink_param_driverinit_value_get/set() calls

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29 22:13:09 -08:00
Vasundhara Volam
39e6160e14 devlink: Add devlink_param for port register and unregister
Add functions to register and unregister for the driver supported
configuration parameters table per port.

v7->v8:
- Order the definitions following way as suggested by Jiri.
__devlink_params_register
__devlink_params_unregister
devlink_params_register
devlink_params_unregister
devlink_port_params_register
devlink_port_params_unregister
- Append with Acked-by: Jiri Pirko <jiri@mellanox.com>.

v2->v3:
- Add a helper __devlink_params_register() with common code used by
  both devlink_params_register() and devlink_port_params_register().

Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-29 22:13:08 -08:00
David S. Miller
eaf2a47f40 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-01-29 21:18:54 -08:00
Dave Watson
32eb67b93c net: tls: Save iv in tls_rec for async crypto requests
aead_request_set_crypt takes an iv pointer, and we change the iv
soon after setting it.  Some async crypto algorithms don't save the iv,
so we need to save it in the tls_rec for async requests.

Found by hardcoding x64 aesni to use async crypto manager (to test the async
codepath), however I don't think this combination can happen in the wild.
Presumably other hardware offloads will need this fix, but there have been
no user reports.

Fixes: a42055e8d2 ("Add support for async encryption of records...")
Signed-off-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28 23:05:55 -08:00
David S. Miller
ec7146db15 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-01-29

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Teach verifier dead code removal, this also allows for optimizing /
   removing conditional branches around dead code and to shrink the
   resulting image. Code store constrained architectures like nfp would
   have hard time doing this at JIT level, from Jakub.

2) Add JMP32 instructions to BPF ISA in order to allow for optimizing
   code generation for 32-bit sub-registers. Evaluation shows that this
   can result in code reduction of ~5-20% compared to 64 bit-only code
   generation. Also add implementation for most JITs, from Jiong.

3) Add support for __int128 types in BTF which is also needed for
   vmlinux's BTF conversion to work, from Yonghong.

4) Add a new command to bpftool in order to dump a list of BPF-related
   parameters from the system or for a specific network device e.g. in
   terms of available prog/map types or helper functions, from Quentin.

5) Add AF_XDP sock_diag interface for querying sockets from user
   space which provides information about the RX/TX/fill/completion
   rings, umem, memory usage etc, from Björn.

6) Add skb context access for skb_shared_info->gso_segs field, from Eric.

7) Add support for testing flow dissector BPF programs by extending
   existing BPF_PROG_TEST_RUN infrastructure, from Stanislav.

8) Split BPF kselftest's test_verifier into various subgroups of tests
   in order better deal with merge conflicts in this area, from Jakub.

9) Add support for queue/stack manipulations in bpftool, from Stanislav.

10) Document BTF, from Yonghong.

11) Dump supported ELF section names in libbpf on program load
    failure, from Taeung.

12) Silence a false positive compiler warning in verifier's BTF
    handling, from Peter.

13) Fix help string in bpftool's feature probing, from Prashant.

14) Remove duplicate includes in BPF kselftests, from Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28 19:38:33 -08:00
David S. Miller
343917b410 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree:

1) Introduce a hashtable to speed up object lookups, from Florian Westphal.

2) Make direct calls to built-in extension, also from Florian.

3) Call helper before confirming the conntrack as it used to be originally,
   from Florian.

4) Call request_module() to autoload br_netfilter when physdev is used
   to relax the dependency, also from Florian.

5) Allow to insert rules at a given position ID that is internal to the
   batch, from Phil Sutter.

6) Several patches to replace conntrack indirections by direct calls,
   and to reduce modularization, from Florian. This also includes
   several follow up patches to deal with minor fallout from this
   rework.

7) Use RCU from conntrack gre helper, from Florian.

8) GRE conntrack module becomes built-in into nf_conntrack, from Florian.

9) Replace nf_ct_invert_tuplepr() by calls to nf_ct_invert_tuple(),
   from Florian.

10) Unify sysctl handling at the core of nf_conntrack, from Florian.

11) Provide modparam to register conntrack hooks.

12) Allow to match on the interface kind string, from wenxu.

13) Remove several exported symbols, not required anymore now after
    a bit of de-modulatization work has been done, from Florian.

14) Remove built-in map support in the hash extension, this can be
    done with the existing userspace infrastructure, from laura.

15) Remove indirection to calculate checksums in IPVS, from Matteo Croce.

16) Use call wrappers for indirection in IPVS, also from Matteo.

17) Remove superfluous __percpu parameter in nft_counter, patch from
    Luc Van Oostenryck.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-28 17:34:38 -08:00
Matteo Croce
fe19a8fea7 ipvs: avoid indirect calls when calculating checksums
The function pointer ip_vs_protocol->csum_check is only used in protocol
specific code, and never in the generic one.
Remove the function pointer from struct ip_vs_protocol and call the
checksum functions directly.
This reduces the performance impact of the Spectre mitigation, and
should give a small improvement even with RETPOLINES disabled.

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-28 11:15:58 +01:00
Wei Wang
4a41f453be tcp: change pingpong threshold to 3
In order to be more confident about an on-going interactive session, we
increment pingpong count by 1 for every interactive transaction and we
adjust TCP_PINGPONG_THRESH to 3.
This means, we only consider a session in pingpong mode after we see 3
interactive transactions, and start to activate delayed acks in quick
ack mode.
And in order to not over-count the credits, we only increase pingpong
count for the first packet sent in response for the previous received
packet.
This is mainly to prevent delaying the ack immediately after some
handshake protocol but no real interactive traffic pattern afterwards.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27 13:29:43 -08:00
Wei Wang
31954cd8bb tcp: Refactor pingpong code
Instead of using pingpong as a single bit information, we refactor the
code to treat it as a counter. When interactive session is detected,
we set pingpong count to TCP_PINGPONG_THRESH. And when pingpong count
is >= TCP_PINGPONG_THRESH, we consider the session in pingpong mode.

This patch is a pure refactor and sets foundation for the next patch.
This patch itself does not change any pingpong logic.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-27 13:29:43 -08:00
David S. Miller
1d68101367 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2019-01-27 10:43:17 -08:00
wenxu
c8b34e680a ip_tunnel: Add tnl_update_pmtu in ip_md_tunnel_xmit
Add tnl_update_pmtu in ip_md_tunnel_xmit to dynamic modify
the pmtu which packet send through collect_metadata mode
ip tunnel

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-26 09:43:03 -08:00
Willem de Bruijn
f859a44847 tcp: allow zerocopy with fastopen
Accept MSG_ZEROCOPY in all the TCP states that allow sendmsg. Remove
the explicit check for ESTABLISHED and CLOSE_WAIT states.

This requires correctly handling zerocopy state (uarg, sk_zckey) in
all paths reachable from other TCP states. Such as the EPIPE case
in sk_stream_wait_connect, which a sendmsg() in incorrect state will
now hit. Most paths are already safe.

Only extension needed is for TCP Fastopen active open. This can build
an skb with data in tcp_send_syn_data. Pass the uarg along with other
fastopen state, so that this skb also generates a zerocopy
notification on release.

Tested with active and passive tcp fastopen packetdrill scripts at
1747eef03d

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25 22:41:08 -08:00
Peter Oskolkov
d4289fcc9b net: IP6 defrag: use rbtrees for IPv6 defrag
Currently, IPv6 defragmentation code drops non-last fragments that
are smaller than 1280 bytes: see
commit 0ed4229b08 ("ipv6: defrag: drop non-last frags smaller than min mtu")

This behavior is not specified in IPv6 RFCs and appears to break
compatibility with some IPv6 implemenations, as reported here:
https://www.spinics.net/lists/netdev/msg543846.html

This patch re-uses common IP defragmentation queueing and reassembly
code in IPv6, removing the 1280 byte restriction.

Signed-off-by: Peter Oskolkov <posk@google.com>
Reported-by: Tom Herbert <tom@herbertland.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25 21:37:11 -08:00
Peter Oskolkov
c23f35d19d net: IP defrag: encapsulate rbtree defrag code into callable functions
This is a refactoring patch: without changing runtime behavior,
it moves rbtree-related code from IPv4-specific files/functions
into .h/.c defrag files shared with IPv6 defragmentation code.

Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Cc: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25 21:37:11 -08:00
Srinivas Dasari
fe4943702c cfg80211: Authentication offload to user space in AP mode
commit 40cbfa9021 ("cfg80211/nl80211: Optional authentication
offload to userspace")' introduced authentication offload to user
space by the host drivers in station mode. This commit extends
the same for the AP mode too.

Extend NL80211_ATTR_EXTERNAL_AUTH_SUPPORT to also claim the
support of external authentication from the user space in AP mode.
A new flag parameter is introduced in cfg80211_ap_settings to
intend the same while "start ap".

Host driver to use NL80211_CMD_FRAME interface to transmit and
receive the authentication frames to / from the user space.

Host driver to indicate the flag NL80211_RXMGMT_FLAG_EXTERNAL_AUTH
while sending the authentication frame to the user space. This
intends to the user space that the driver wishes it to process
the authentication frame for certain protocols, though it had
initially advertised the support for SME functionality.

User space shall accordingly do the authentication and indicate
its final status through the command NL80211_CMD_EXTERNAL_AUTH.
Allow the command even if userspace doesn't include the attribute
NL80211_ATTR_SSID for AP interface.

Host driver shall continue with the association sequence and
indicate the STA connection status through cfg80211_new_sta.

To facilitate the host drivers in AP mode for matching the pmkid
by the stations during the association, NL80211_CMD_EXTERNAL_AUTH
is also enhanced to include the pmkid to drivers after
the authentication.
This pmkid can also be used in the STA mode to include in the
association request.

Also modify nl80211_external_auth to not mandate SSID in AP mode.

Signed-off-by: Srinivas Dasari <dasaris@codeaurora.org>
[remove useless nla_get_flag() usage]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-25 21:08:05 +01:00
David S. Miller
30e5c2c6bf net: Revert devlink health changes.
This reverts the devlink health changes from 9/17/2019,
Jiri wants things to be designed differently and it was
agreed that the easiest way to do this is start from the
beginning again.

Commits reverted:

cb5ccfbe73
880ee82f03
c7af343b4e
ff253fedab
6f9d56132e
fcd852c69d
8a66704a13
12bd0dcefe
aba25279c1
ce019faa70
b8c45a033a

And the follow-on build fix:

o33a0efa4baecd689da9474ce0e8b673eb6931c60

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-25 10:53:23 -08:00
Veerendranath Jakkam
ab4dfa2053 cfg80211: Allow drivers to advertise supported AKM suites
There was no such capability advertisement from the driver and thus the
current user space has to assume the driver to support all the AKMs. While
that may be the case with some drivers (e.g., mac80211-based ones), there
are cfg80211-based drivers that implement SME and have constraints on
which AKMs can be supported (e.g., such drivers may need an update to
support SAE AKM using NL80211_CMD_EXTERNAL_AUTH). Allow such drivers to
advertise the exact set of supported AKMs so that user space tools can
determine what network profile options should be allowed to be configured.

Signed-off-by: Veerendranath Jakkam <vjakkam@codeaurora.org>
[pmsr data might be big, start a new netlink message section]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-25 14:05:31 +01:00
Toke Høiland-Jørgensen
390298e86f mac80211: Expose ieee80211_schedule_txq() function
Since we reworked ieee80211_return_txq() so it assumes that the caller
takes care of logging, we need another function that can be called without
holding any locks. Introduce ieee80211_schedule_txq() which serves this
purpose.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-25 14:05:27 +01:00
Rajat Jain
e2bef3847e Bluetooth: Allow driver specific cmd timeout handling
Add a hook to allow the BT driver to do device or command specific
handling in case of timeouts. This is to be used by Intel driver to
reset the device after certain number of timeouts.

Signed-off-by: Rajat Jain <rajatja@google.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2019-01-25 08:46:32 +01:00
Priyaranjan Jha
78dc70ebaa tcp_bbr: adapt cwnd based on ack aggregation estimation
Aggregation effects are extremely common with wifi, cellular, and cable
modem link technologies, ACK decimation in middleboxes, and LRO and GRO
in receiving hosts. The aggregation can happen in either direction,
data or ACKs, but in either case the aggregation effect is visible
to the sender in the ACK stream.

Previously BBR's sending was often limited by cwnd under severe ACK
aggregation/decimation because BBR sized the cwnd at 2*BDP. If packets
were acked in bursts after long delays (e.g. one ACK acking 5*BDP after
5*RTT), BBR's sending was halted after sending 2*BDP over 2*RTT, leaving
the bottleneck idle for potentially long periods. Note that loss-based
congestion control does not have this issue because when facing
aggregation it continues increasing cwnd after bursts of ACKs, growing
cwnd until the buffer is full.

To achieve good throughput in the presence of aggregation effects, this
algorithm allows the BBR sender to put extra data in flight to keep the
bottleneck utilized during silences in the ACK stream that it has evidence
to suggest were caused by aggregation.

A summary of the algorithm: when a burst of packets are acked by a
stretched ACK or a burst of ACKs or both, BBR first estimates the expected
amount of data that should have been acked, based on its estimated
bandwidth. Then the surplus ("extra_acked") is recorded in a windowed-max
filter to estimate the recent level of observed ACK aggregation. Then cwnd
is increased by the ACK aggregation estimate. The larger cwnd avoids BBR
being cwnd-limited in the face of ACK silences that recent history suggests
were caused by aggregation. As a sanity check, the ACK aggregation degree
is upper-bounded by the cwnd (at the time of measurement) and a global max
of BW * 100ms. The algorithm is further described by the following
presentation:
https://datatracker.ietf.org/meeting/101/materials/slides-101-iccrg-an-update-on-bbr-work-at-google-00

In our internal testing, we observed a significant increase in BBR
throughput (measured using netperf), in a basic wifi setup.
- Host1 (sender on ethernet) -> AP -> Host2 (receiver on wifi)
- 2.4 GHz -> BBR before: ~73 Mbps; BBR after: ~102 Mbps; CUBIC: ~100 Mbps
- 5.0 GHz -> BBR before: ~362 Mbps; BBR after: ~593 Mbps; CUBIC: ~601 Mbps

Also, this code is running globally on YouTube TCP connections and produced
significant bandwidth increases for YouTube traffic.

This is based on Ian Swett's max_ack_height_ algorithm from the
QUIC BBR implementation.

Signed-off-by: Priyaranjan Jha <priyarjha@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-24 22:27:27 -08:00
Nikolay Aleksandrov
949e7cea0c bonding: count master 3ad stats separately
I made a dumb mistake when I summed up the slave stats, obviously slaves
can come and go which would make the master stats unreliable.
Count and export the master stats separately.

Fixes: a258aeacd7 ("bonding: add support for xstats and export 3ad stats")
Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-24 22:18:48 -08:00
Björn Töpel
50e74c0131 xsk: add id to umem
This commit adds an id to the umem structure. The id uniquely
identifies a umem instance, and will be exposed to user-space via the
socket monitoring interface.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-25 01:50:03 +01:00
Björn Töpel
1d0dc06930 net: xsk: track AF_XDP sockets on a per-netns list
Track each AF_XDP socket in a per-netns list. This will be used later
by the sock_diag interface for querying sockets from userspace.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-01-25 01:50:03 +01:00
Eric Dumazet
63530aba78 ax25: fix possible use-after-free
syzbot found that ax25 routes where not properly protected
against concurrent use [1].

In this particular report the bug happened while
copying ax25->digipeat.

Fix this problem by making sure we call ax25_get_route()
while ax25_route_lock is held, so that no modification
could happen while using the route.

The current two ax25_get_route() callers do not sleep,
so this change should be fine.

Once we do that, ax25_get_route() no longer needs to
grab a reference on the found route.

[1]
ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
BUG: KASAN: use-after-free in memcpy include/linux/string.h:352 [inline]
BUG: KASAN: use-after-free in kmemdup+0x42/0x60 mm/util.c:113
Read of size 66 at addr ffff888066641a80 by task syz-executor2/531

ax25_connect(): syz-executor0 uses autobind, please contact jreuter@yaina.de
CPU: 1 PID: 531 Comm: syz-executor2 Not tainted 5.0.0-rc2+ #10
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1db/0x2d0 lib/dump_stack.c:113
 print_address_description.cold+0x7c/0x20d mm/kasan/report.c:187
 kasan_report.cold+0x1b/0x40 mm/kasan/report.c:317
 check_memory_region_inline mm/kasan/generic.c:185 [inline]
 check_memory_region+0x123/0x190 mm/kasan/generic.c:191
 memcpy+0x24/0x50 mm/kasan/common.c:130
 memcpy include/linux/string.h:352 [inline]
 kmemdup+0x42/0x60 mm/util.c:113
 kmemdup include/linux/string.h:425 [inline]
 ax25_rt_autobind+0x25d/0x750 net/ax25/ax25_route.c:424
 ax25_connect.cold+0x30/0xa4 net/ax25/af_ax25.c:1224
 __sys_connect+0x357/0x490 net/socket.c:1664
 __do_sys_connect net/socket.c:1675 [inline]
 __se_sys_connect net/socket.c:1672 [inline]
 __x64_sys_connect+0x73/0xb0 net/socket.c:1672
 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x458099
Code: 6d b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 3b b7 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f870ee22c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000458099
RDX: 0000000000000048 RSI: 0000000020000080 RDI: 0000000000000005
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f870ee236d4
R13: 00000000004be48e R14: 00000000004ce9a8 R15: 00000000ffffffff

Allocated by task 526:
 save_stack+0x45/0xd0 mm/kasan/common.c:73
 set_track mm/kasan/common.c:85 [inline]
 __kasan_kmalloc mm/kasan/common.c:496 [inline]
 __kasan_kmalloc.constprop.0+0xcf/0xe0 mm/kasan/common.c:469
 kasan_kmalloc+0x9/0x10 mm/kasan/common.c:504
ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
 kmem_cache_alloc_trace+0x151/0x760 mm/slab.c:3609
 kmalloc include/linux/slab.h:545 [inline]
 ax25_rt_add net/ax25/ax25_route.c:95 [inline]
 ax25_rt_ioctl+0x3b9/0x1270 net/ax25/ax25_route.c:233
 ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
 sock_do_ioctl+0xe2/0x400 net/socket.c:950
 sock_ioctl+0x32f/0x6c0 net/socket.c:1074
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

ax25_connect(): syz-executor5 uses autobind, please contact jreuter@yaina.de
Freed by task 550:
 save_stack+0x45/0xd0 mm/kasan/common.c:73
 set_track mm/kasan/common.c:85 [inline]
 __kasan_slab_free+0x102/0x150 mm/kasan/common.c:458
 kasan_slab_free+0xe/0x10 mm/kasan/common.c:466
 __cache_free mm/slab.c:3487 [inline]
 kfree+0xcf/0x230 mm/slab.c:3806
 ax25_rt_add net/ax25/ax25_route.c:92 [inline]
 ax25_rt_ioctl+0x304/0x1270 net/ax25/ax25_route.c:233
 ax25_ioctl+0x322/0x10b0 net/ax25/af_ax25.c:1763
 sock_do_ioctl+0xe2/0x400 net/socket.c:950
 sock_ioctl+0x32f/0x6c0 net/socket.c:1074
 vfs_ioctl fs/ioctl.c:46 [inline]
 file_ioctl fs/ioctl.c:509 [inline]
 do_vfs_ioctl+0x107b/0x17d0 fs/ioctl.c:696
 ksys_ioctl+0xab/0xd0 fs/ioctl.c:713
 __do_sys_ioctl fs/ioctl.c:720 [inline]
 __se_sys_ioctl fs/ioctl.c:718 [inline]
 __x64_sys_ioctl+0x73/0xb0 fs/ioctl.c:718
 do_syscall_64+0x1a3/0x800 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe

The buggy address belongs to the object at ffff888066641a80
 which belongs to the cache kmalloc-96 of size 96
The buggy address is located 0 bytes inside of
 96-byte region [ffff888066641a80, ffff888066641ae0)
The buggy address belongs to the page:
page:ffffea0001999040 count:1 mapcount:0 mapping:ffff88812c3f04c0 index:0x0
flags: 0x1fffc0000000200(slab)
ax25_connect(): syz-executor4 uses autobind, please contact jreuter@yaina.de
raw: 01fffc0000000200 ffffea0001817948 ffffea0002341dc8 ffff88812c3f04c0
raw: 0000000000000000 ffff888066641000 0000000100000020 0000000000000000
page dumped because: kasan: bad access detected

Memory state around the buggy address:
 ffff888066641980: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff888066641a00: 00 00 00 00 00 00 00 00 02 fc fc fc fc fc fc fc
>ffff888066641a80: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
                   ^
 ffff888066641b00: fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc fc
 ffff888066641b80: 00 00 00 00 00 00 00 00 00 00 00 00 fc fc fc fc

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ralf Baechle <ralf@linux-mips.org>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-23 11:18:00 -08:00
Linus Lüssing
4b3087c7e3 bridge: Snoop Multicast Router Advertisements
When multiple multicast routers are present in a broadcast domain then
only one of them will be detectable via IGMP/MLD query snooping. The
multicast router with the lowest IP address will become the selected and
active querier while all other multicast routers will then refrain from
sending queries.

To detect such rather silent multicast routers, too, RFC4286
("Multicast Router Discovery") provides a standardized protocol to
detect multicast routers for multicast snooping switches.

This patch implements the necessary MRD Advertisement message parsing
and after successful processing adds such routers to the internal
multicast router list.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22 17:18:09 -08:00
Linus Lüssing
ba5ea61462 bridge: simplify ip_mc_check_igmp() and ipv6_mc_check_mld() calls
This patch refactors ip_mc_check_igmp(), ipv6_mc_check_mld() and
their callers (more precisely, the Linux bridge) to not rely on
the skb_trimmed parameter anymore.

An skb with its tail trimmed to the IP packet length was initially
introduced for the following three reasons:

1) To be able to verify the ICMPv6 checksum.
2) To be able to distinguish the version of an IGMP or MLD query.
   They are distinguishable only by their size.
3) To avoid parsing data for an IGMPv3 or MLDv2 report that is
   beyond the IP packet but still within the skb.

The first case still uses a cloned and potentially trimmed skb to
verfiy. However, there is no need to propagate it to the caller.
For the second and third case explicit IP packet length checks were
added.

This hopefully makes ip_mc_check_igmp() and ipv6_mc_check_mld() easier
to read and verfiy, as well as easier to use.

Signed-off-by: Linus Lüssing <linus.luessing@c0d3.blue>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22 17:18:08 -08:00
Nikolay Aleksandrov
a258aeacd7 bonding: add support for xstats and export 3ad stats
This patch adds support for extended statistics (xstats) call to the
bonding. The first user would be the 3ad code which counts the following
events:
 - LACPDU Rx/Tx
 - LACPDU unknown type Rx
 - LACPDU illegal Rx
 - Marker Rx/Tx
 - Marker response Rx/Tx
 - Marker unknown type Rx

All of these are exported via netlink as separate attributes to be
easily extensible as we plan to add more in the future.
Similar to how the bridge and other xstats exports, the structure
inside is:
 [ IFLA_STATS_LINK_XSTATS ]
   -> [ LINK_XSTATS_TYPE_BOND ]
        -> [ BOND_XSTATS_3AD ]
             -> [ 3ad stats attributes ]

With this structure it's easy to add more stat types later.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22 12:04:14 -08:00
Nikolay Aleksandrov
267c095aa2 bonding: add 3ad stats
Count the following types of 3ad packets per slave:
 - rx/tx lacpdu
 - rx/tx marker
 - rx/tx marker response
 - rx illegal lacpdus (right now counted on wrong length)
 - rx unknown lacpdu type
 - rx unknown marker type

The counters are using atomic64 since this is not fast path.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-22 12:04:14 -08:00
Matthias Kaehlcke
c4f5627f7e Bluetooth: Fix locking in bt_accept_enqueue() for BH context
With commit e163376220 ("Bluetooth: Handle bt_accept_enqueue() socket
atomically") lock_sock[_nested]() is used to acquire the socket lock
before manipulating the socket. lock_sock[_nested]() may block, which
is problematic since bt_accept_enqueue() can be called in bottom half
context (e.g. from rfcomm_connect_ind()):

[<ffffff80080d81ec>] __might_sleep+0x4c/0x80
[<ffffff800876c7b0>] lock_sock_nested+0x24/0x58
[<ffffff8000d7c27c>] bt_accept_enqueue+0x48/0xd4 [bluetooth]
[<ffffff8000e67d8c>] rfcomm_connect_ind+0x190/0x218 [rfcomm]

Add a parameter to bt_accept_enqueue() to indicate whether the
function is called from BH context, and acquire the socket lock
with bh_lock_sock_nested() if that's the case.

Also adapt all callers of bt_accept_enqueue() to pass the new
parameter:

- l2cap_sock_new_connection_cb()
  - uses lock_sock() to lock the parent socket => process context

- rfcomm_connect_ind()
  - acquires the parent socket lock with bh_lock_sock() => BH
    context

- __sco_chan_add()
  - called from sco_chan_add(), which is called from sco_connect().
    parent is NULL, hence bt_accept_enqueue() isn't called in this
    code path and we can ignore it
  - also called from sco_conn_ready(). uses bh_lock_sock() to acquire
    the parent lock => BH context

Fixes: e163376220 ("Bluetooth: Handle bt_accept_enqueue() socket atomically")
Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
Reviewed-by: Douglas Anderson <dianders@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
Cc: stable@vger.kernel.org
2019-01-22 09:51:20 +01:00
David S. Miller
fa7f3a8d56 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Completely minor snmp doc conflict.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-21 14:41:32 -08:00
Toke Høiland-Jørgensen
5f2939d933 sch_api: Change signature of qdisc_tree_reduce_backlog() to use ints
There are now several places where qdisc_tree_reduce_backlog() is called
with a negative number of packets (to signal an increase in number of
packets in the queue). Rather than rely on overflow behaviour, change the
function signature to use signed integers to communicate this usage to
people reading the code.

Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-19 09:53:18 -08:00
Felix Fietkau
77f7ffdc33 mac80211: minstrel_ht: add flag to indicate missing/inaccurate tx A-MPDU length
Some hardware (e.g. MediaTek MT7603) cannot report A-MPDU length in tx status
information. Add support for a flag to indicate that, to allow minstrel_ht
to use a fixed value in its internal calculation (which gives better results
than just defaulting to 1).

Signed-off-by: Felix Fietkau <nbd@nbd.name>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-19 09:56:56 +01:00
Julan Hsu
540bbcb930 nl80211/mac80211: mesh: add mesh path change count to mpath info
Expose path change count to destination in mpath info

Signed-off-by: Julan Hsu <julanhsu@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-19 09:55:20 +01:00
Julan Hsu
cc24163690 nl80211/mac80211: mesh: add hop count to mpath info
Expose hop count to destination information in mpath info

Signed-off-by: Julan Hsu <julanhsu@google.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-19 09:54:36 +01:00
Toke Høiland-Jørgensen
b4809e9484 mac80211: Add airtime accounting and scheduling to TXQs
This adds airtime accounting and scheduling to the mac80211 TXQ
scheduler. A new callback, ieee80211_sta_register_airtime(), is added
that drivers can call to report airtime usage for stations.

When airtime information is present, mac80211 will schedule TXQs
(through ieee80211_next_txq()) in a way that enforces airtime fairness
between active stations. This scheduling works the same way as the ath9k
in-driver airtime fairness scheduling. If no airtime usage is reported
by the driver, the scheduler will default to round-robin scheduling.

For drivers that don't control TXQ scheduling in software, a new API
function, ieee80211_txq_may_transmit(), is added which the driver can use
to check if the TXQ is eligible for transmission, or should be throttled to
enforce fairness. Calls to this function must also be enclosed in
ieee80211_txq_schedule_{start,end}() calls to ensure proper locking.

The API ieee80211_txq_may_transmit() also ensures that TXQ list will be
aligned aginst driver's own round-robin scheduler list. i.e it rotates
the TXQ list till it makes the requested node becomes the first entry
in TXQ list. Thus both the TXQ list and driver's list are in sync.

Co-developed-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Louie Lu <git@louie.lu>
[added debugfs write op to reset airtime counter]
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-19 09:31:43 +01:00
Toke Høiland-Jørgensen
36647055b3 cfg80211: Add airtime statistics and settings
This adds TX airtime statistics to the cfg80211 station dump (to go along
with the RX info already present), and adds a new parameter to set the
airtime weight of each station. The latter allows userspace to implement
policies for different stations by varying their weights.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[rmanohar@codeaurora.org: fixed checkpatch warnings]
Signed-off-by: Rajkumar Manoharan <rmanohar@codeaurora.org>
[move airtime weight != 0 check into policy]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-19 09:31:35 +01:00
Toke Høiland-Jørgensen
1866760096 mac80211: Add TXQ scheduling API
This adds an API to mac80211 to handle scheduling of TXQs. The interface
between driver and mac80211 for TXQ handling is changed by adding two new
functions: ieee80211_next_txq(), which will return the next TXQ to schedule
in the current round-robin rotation, and ieee80211_return_txq(), which the
driver uses to indicate that it has finished scheduling a TXQ (which will
then be put back in the scheduling rotation if it isn't empty).

The driver must call ieee80211_txq_schedule_start() at the start of each
scheduling session, and ieee80211_txq_schedule_end() at the end. The API
then guarantees that the same TXQ is not returned twice in the same
session (so a driver can loop on ieee80211_next_txq() without worrying
about breaking the loop.

Usage of the new API is optional, so drivers can be ported one at a time.
In this patch, the actual scheduling performed by mac80211 is simple
round-robin, but a subsequent commit adds airtime fairness awareness to the
scheduler.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
[minor kernel-doc fix, propagate sparse locking checks out]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-01-19 09:24:12 +01:00
Eran Ben Elisha
c7af343b4e devlink: Add health report functionality
Upon error discover, every driver can report it to the devlink health
mechanism via devlink_health_report function, using the appropriate
reporter registered to it. Driver can pass error specific context which
will be delivered to it as part of the dump / recovery callbacks.

Once an error is reported, devlink health will do the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and stored at the reporter instance (as long
  as there is no other dump which is already stored)
* Auto recovery attempt is being done. depends on:
  - Auto Recovery configuration
  - Grace period vs. time since last recover

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-18 14:51:22 -08:00
Eran Ben Elisha
880ee82f03 devlink: Add health reporter create/destroy functionality
Devlink health reporter is an instance for reporting, diagnosing and
recovering from run time errors discovered by the reporters.
Define it's data structure and supported operations.
In addition, expose devlink API to create and destroy a reporter.
Each devlink instance will hold it's own reporters list.

As part of the allocation, driver shall provide a set of callbacks which
will be used the devlink in order to handle health reports and user
commands related to this reporter. In addition, driver is entitled to
provide some priv pointer, which can be fetched from the reporter by
devlink_health_reporter_priv function.

For each reporter, devlink will hold a metadata of statistics,
buffers and status.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-18 14:51:22 -08:00
Eran Ben Elisha
cb5ccfbe73 devlink: Add health buffer support
Devlink health buffer is a mechanism to pass descriptors between drivers
and devlink. The API allows the driver to add objects, object pair,
value array (nested attributes), value and name.

Driver can use this API to fill the buffers in a format which can be
translated by the devlink to the netlink message.

In order to fulfill it, an internal buffer descriptor is defined. This
will hold the data and metadata per each attribute and by used to pass
actual commands to the netlink.

This mechanism will be later used in devlink health for dump and diagnose
data store by the drivers.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-18 14:51:22 -08:00
Yafang Shao
340a6f3d2d tcp: declare tcp_mmap() only when CONFIG_MMU is set
Since tcp_mmap() is defined when CONFIG_MMU is set.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-18 14:03:53 -08:00
Florian Westphal
472caa6918 netfilter: nat: un-export nf_nat_used_tuple
Not used since 203f2e7820 ("netfilter: nat: remove l4proto->unique_tuple")

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:59:45 +01:00
Florian Westphal
4a60dc748d netfilter: conntrack: remove nf_ct_l4proto_find_get
Its now same as __nf_ct_l4proto_find(), so rename that to
nf_ct_l4proto_find and use it everywhere.

It never returns NULL and doesn't need locks or reference counts.

Before this series:
302824  net/netfilter/nf_conntrack.ko
 21504  net/netfilter/nf_conntrack_proto_gre.ko

  text	   data	    bss	    dec	    hex	filename
  6281	   1732	      4	   8017	   1f51	nf_conntrack_proto_gre.ko
108356	  20613	    236	 129205	  1f8b5	nf_conntrack.ko

After:
294864  net/netfilter/nf_conntrack.ko
  text	   data	    bss	    dec	    hex	filename
106979	  19557	    240	 126776	  1ef38	nf_conntrack.ko

so, even with builtin gre, total size got reduced.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
e56894356f netfilter: conntrack: remove l4proto destroy hook
Only one user (gre), add a direct call and remove this facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
2a389de86e netfilter: conntrack: remove l4proto init and get_net callbacks
Those were needed we still had modular trackers.
As we don't have those anymore, prefer direct calls and remove all
the (un)register infrastructure associated with this.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
70aed4647c netfilter: conntrack: remove sysctl registration helpers
After previous patch these are not used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
303e0c5589 netfilter: conntrack: avoid unneeded nf_conntrack_l4proto lookups
after removal of the packet and invert function pointers, several
places do not need to lookup the l4proto structure anymore.

Remove those lookups.
The function nf_ct_invert_tuplepr becomes redundant, replace
it with nf_ct_invert_tuple everywhere.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
edf0338dab netfilter: conntrack: remove pernet l4 proto register interface
No used anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
44fb87f635 netfilter: conntrack: remove remaining l4proto indirect packet calls
Now that all l4trackers are builtin, no need to use a mix of direct and
indirect calls.
This removes the last two users: gre and the generic l4 protocol
tracker.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
b184356d0a netfilter: conntrack: remove module owner field
No need to get/put module owner reference, none of these can be removed
anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
197c4300ae netfilter: conntrack: remove invert_tuple callback
Only used by icmp(v6).  Prefer a direct call and remove this
function from the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
df5e162908 netfilter: conntrack: remove pkt_to_tuple callback
GRE is now builtin, so we can handle it via direct call and
remove the callback.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:34 +01:00
Florian Westphal
751fc301ec netfilter: conntrack: remove net_id
No users anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
22fc4c4c9f netfilter: conntrack: gre: switch module to be built-in
This makes the last of the modular l4 trackers 'bool'.

After this, all infrastructure to handle dynamic l4 protocol registration
becomes obsolete and can be removed in followup patches.

Old:
302824 net/netfilter/nf_conntrack.ko
 21504 net/netfilter/nf_conntrack_proto_gre.ko

New:
313728 net/netfilter/nf_conntrack.ko

Old:
   text	   data	    bss	    dec	    hex	filename
   6281	   1732	      4	   8017	   1f51	nf_conntrack_proto_gre.ko
 108356	  20613	    236	 129205	  1f8b5	nf_conntrack.ko
New:
 112095	  21381	    240	 133716	  20a54	nf_conntrack.ko

The size increase is only temporary.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
e2e48b4716 netfilter: conntrack: handle icmp pkt_to_tuple helper via direct calls
rather than handling them via indirect call, use a direct one instead.
This leaves GRE as the last user of this indirect call facility.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
a47c540481 netfilter: conntrack: handle builtin l4proto packet functions via direct calls
The l4 protocol trackers are invoked via indirect call: l4proto->packet().

With one exception (gre), all l4trackers are builtin, so we can make
.packet optional and use a direct call for most protocols.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
8e2f311a68 netfilter: physdev: relax br_netfilter dependency
Following command:
  iptables -D FORWARD -m physdev ...
causes connectivity loss in some setups.

Reason is that iptables userspace will probe kernel for the module revision
of the physdev patch, and physdev has an artificial dependency on
br_netfilter (xt_physdev use makes no sense unless a br_netfilter module
is loaded).

This causes the "phydev" module to be loaded, which in turn enables the
"call-iptables" infrastructure.

bridged packets might then get dropped by the iptables ruleset.

The better fix would be to change the "call-iptables" defaults to 0 and
enforce explicit setting to 1, but that breaks backwards compatibility.

This does the next best thing: add a request_module call to checkentry.
This was a stray '-D ... -m physdev' won't activate br_netfilter
anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
10870dd89e netfilter: nf_tables: add direct calls for all builtin expressions
With CONFIG_RETPOLINE its faster to add an if (ptr == &foo_func)
check and and use direct calls for all the built-in expressions.

~15% improvement in pathological cases.

checkpatch doesn't like the X macro due to the embedded return statement,
but the macro has a very limited scope so I don't think its a problem.

I would like to avoid bugs of the form
  If (e->ops->eval == (unsigned long)nft_foo_eval)
	 nft_bar_eval();

and open-coded if ()/else if()/else cascade, thus the macro.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
4d44175aa5 netfilter: nf_tables: handle nft_object lookups via rhltable
Instead of linear search, use rhlist interface to look up the objects.
This fixes rulesets with thousands of named objects (quota, counters and
the like).

We only use a single table for this and consider the address of the
table we're doing the lookup in as a part of the key.

This reduces restore time of a sample ruleset with ~20k named counters
from 37 seconds to 0.8 seconds.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:33 +01:00
Florian Westphal
d152159b89 netfilter: nf_tables: prepare nft_object for lookups via hashtable
Add a 'key' structure for object, so we can look them up by name + table
combination (the name can be the same in each table).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-18 15:02:32 +01:00
Petr Machata
6685987c29 switchdev: Add extack argument to call_switchdev_notifiers()
A follow-up patch will enable vetoing of FDB entries. Make it possible
to communicate details of why an FDB entry is not acceptable back to the
user.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 15:18:47 -08:00
Petr Machata
4c59b7d160 vxlan: Add extack to switchdev operations
There are four sources of VXLAN switchdev notifier calls:

- the changelink() link operation, which already supports extack,
- ndo_fdb_add() which got extack support in a previous patch,
- FDB updates due to packet forwarding,
- and vxlan_fdb_replay().

Extend vxlan_fdb_switchdev_call_notifiers() to include extack in the
switchdev message that it sends, and propagate the argument upwards to
the callers. For the first two cases, pass in the extack gotten through
the operation. For case #3, pass in NULL.

To cover the last case, extend vxlan_fdb_replay() to take extack
argument, which might come from whatever operation necessitated the FDB
replay.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 15:18:47 -08:00
Vakul Garg
692d7b5d1f tls: Fix recvmsg() to be able to peek across multiple records
This fixes recvmsg() to be able to peek across multiple tls records.
Without this patch, the tls's selftests test case
'recv_peek_large_buf_mult_recs' fails. Each tls receive context now
maintains a 'rx_list' to retain incoming skb carrying tls records. If a
tls record needs to be retained e.g. for peek case or for the case when
the buffer passed to recvmsg() has a length smaller than decrypted
record length, then it is added to 'rx_list'. Additionally, records are
added in 'rx_list' if the crypto operation runs in async mode. The
records are dequeued from 'rx_list' after the decrypted data is consumed
by copying into the buffer passed to recvmsg(). In case, the MSG_PEEK
flag is used in recvmsg(), then records are not consumed or removed
from the 'rx_list'.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 14:20:40 -08:00
Florian Fainelli
ecfc937210 net: dsa: Split platform data to header file
Instead of having net/dsa.h contain both the internal switch tree/driver
structures, split the relevant platform_data parts into
include/linux/platform_data/dsa.h and make that header be included by
net/dsa.h in order not to break any setup. A subsequent set of patches
will update code including net/dsa.h to include only the platform_data
header.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-17 11:31:24 -08:00
Florian Fainelli
da7b9e9b00 net: dsa: Add ndo_get_phys_port_name() for CPU port
There is not currently way to infer the port number through sysfs that
is being used as the CPU port number. Overlay a ndo_get_phys_port_name()
operation onto the DSA master network device in order to retrieve that
information.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-16 21:12:21 -08:00
David Howells
e122d845a0 Revert "rxrpc: Allow failed client calls to be retried"
The changes introduced to allow rxrpc calls to be retried creates an issue
when it comes to refcounting afs_call structs.  The problem is that when
rxrpc_send_data() queues the last packet for an asynchronous call, the
following sequence can occur:

 (1) The notify_end_tx callback is invoked which causes the state in the
     afs_call to be changed from AFS_CALL_CL_REQUESTING or
     AFS_CALL_SV_REPLYING.

 (2) afs_deliver_to_call() can then process event notifications from rxrpc
     on the async_work queue.

 (3) Delivery of events, such as an abort from the server, can cause the
     afs_call state to be changed to AFS_CALL_COMPLETE on async_work.

 (4) For an asynchronous call, afs_process_async_call() notes that the call
     is complete and tried to clean up all the refs on async_work.

 (5) rxrpc_send_data() might return the amount of data transferred
     (success) or an error - which could in turn reflect a local error or a
     received error.

Synchronising the clean up after rxrpc_kernel_send_data() returns an error
with the asynchronous cleanup is then tricky to get right.

Mostly revert commit c038a58ccf.  The two API
functions the original commit added aren't currently used.  This makes
rxrpc_kernel_send_data() always return successfully if it queued the data
it was given.

Note that this doesn't affect synchronous calls since their Rx notification
function merely pokes a wait queue and does not refcounting.  The
asynchronous call notification function *has* to do refcounting and pass a
ref over the work item to avoid the need to sync the workqueue in call
cleanup.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15 21:33:36 -08:00
Ido Schimmel
f97f4dd8b3 net: ipv4: Fix memory leak in network namespace dismantle
IPv4 routing tables are flushed in two cases:

1. In response to events in the netdev and inetaddr notification chains
2. When a network namespace is being dismantled

In both cases only routes associated with a dead nexthop group are
flushed. However, a nexthop group will only be marked as dead in case it
is populated with actual nexthops using a nexthop device. This is not
the case when the route in question is an error route (e.g.,
'blackhole', 'unreachable').

Therefore, when a network namespace is being dismantled such routes are
not flushed and leaked [1].

To reproduce:
# ip netns add blue
# ip -n blue route add unreachable 192.0.2.0/24
# ip netns del blue

Fix this by not skipping error routes that are not marked with
RTNH_F_DEAD when flushing the routing tables.

To prevent the flushing of such routes in case #1, add a parameter to
fib_table_flush() that indicates if the table is flushed as part of
namespace dismantle or not.

Note that this problem does not exist in IPv6 since error routes are
associated with the loopback device.

[1]
unreferenced object 0xffff888066650338 (size 56):
  comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
  hex dump (first 32 bytes):
    00 00 00 00 00 00 00 00 b0 1c 62 61 80 88 ff ff  ..........ba....
    e8 8b a1 64 80 88 ff ff 00 07 00 08 fe 00 00 00  ...d............
  backtrace:
    [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
    [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
    [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
    [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
    [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
    [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
    [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
    [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
    [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
    [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000003a8b605b>] 0xffffffffffffffff
unreferenced object 0xffff888061621c88 (size 48):
  comm "ip", pid 1206, jiffies 4294786063 (age 26.235s)
  hex dump (first 32 bytes):
    6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b 6b  kkkkkkkkkkkkkkkk
    6b 6b 6b 6b 6b 6b 6b 6b d8 8e 26 5f 80 88 ff ff  kkkkkkkk..&_....
  backtrace:
    [<00000000733609e3>] fib_table_insert+0x978/0x1500
    [<00000000856ed27d>] inet_rtm_newroute+0x129/0x220
    [<00000000fcdfc00a>] rtnetlink_rcv_msg+0x397/0xa20
    [<00000000cb85801a>] netlink_rcv_skb+0x132/0x380
    [<00000000ebc991d2>] netlink_unicast+0x4c0/0x690
    [<0000000014f62875>] netlink_sendmsg+0x929/0xe10
    [<00000000bac9d967>] sock_sendmsg+0xc8/0x110
    [<00000000223e6485>] ___sys_sendmsg+0x77a/0x8f0
    [<000000002e94f880>] __sys_sendmsg+0xf7/0x250
    [<00000000ccb1fa72>] do_syscall_64+0x14d/0x610
    [<00000000ffbe3dae>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000003a8b605b>] 0xffffffffffffffff

Fixes: 8cced9eff1 ("[NETNS]: Enable routing configuration in non-initial namespace.")
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-15 13:33:44 -08:00
wenxu
10f4e76587 netfilter: nft_flow_offload: fix interaction with vrf slave device
In the forward chain, the iif is changed from slave device to master vrf
device. Thus, flow offload does not find a match on the lower slave
device.

This patch uses the cached route, ie. dst->dev, to update the iif and
oif fields in the flow entry.

After this patch, the following example works fine:

 # ip addr add dev eth0 1.1.1.1/24
 # ip addr add dev eth1 10.0.0.1/24
 # ip link add user1 type vrf table 1
 # ip l set user1 up
 # ip l set dev eth0 master user1
 # ip l set dev eth1 master user1

 # nft add table firewall
 # nft add flowtable f fb1 { hook ingress priority 0 \; devices = { eth0, eth1 } \; }
 # nft add chain f ftb-all {type filter hook forward priority 0 \; policy accept \; }
 # nft add rule f ftb-all ct zone 1 ip protocol tcp flow offload @fb1
 # nft add rule f ftb-all ct zone 1 ip protocol udp flow offload @fb1

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-01-11 00:55:37 +01:00
Linus Torvalds
96d4f267e4 Remove 'type' argument from access_ok() function
Nobody has actually used the type (VERIFY_READ vs VERIFY_WRITE) argument
of the user address range verification function since we got rid of the
old racy i386-only code to walk page tables by hand.

It existed because the original 80386 would not honor the write protect
bit when in kernel mode, so you had to do COW by hand before doing any
user access.  But we haven't supported that in a long time, and these
days the 'type' argument is a purely historical artifact.

A discussion about extending 'user_access_begin()' to do the range
checking resulted this patch, because there is no way we're going to
move the old VERIFY_xyz interface to that model.  And it's best done at
the end of the merge window when I've done most of my merges, so let's
just get this done once and for all.

This patch was mostly done with a sed-script, with manual fix-ups for
the cases that weren't of the trivial 'access_ok(VERIFY_xyz' form.

There were a couple of notable cases:

 - csky still had the old "verify_area()" name as an alias.

 - the iter_iov code had magical hardcoded knowledge of the actual
   values of VERIFY_{READ,WRITE} (not that they mattered, since nothing
   really used it)

 - microblaze used the type argument for a debug printout

but other than those oddities this should be a total no-op patch.

I tried to fix up all architectures, did fairly extensive grepping for
access_ok() uses, and the changes are trivial, but I may have missed
something.  Any missed conversion should be trivially fixable, though.

Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2019-01-03 18:57:57 -08:00
Willem de Bruijn
cb9f1b7838 ip: validate header length on virtual device xmit
KMSAN detected read beyond end of buffer in vti and sit devices when
passing truncated packets with PF_PACKET. The issue affects additional
ip tunnel devices.

Extend commit 76c0ddd8c3 ("ip6_tunnel: be careful when accessing the
inner header") and commit ccfec9e5cb ("ip_tunnel: be careful when
accessing the inner header").

Move the check to a separate helper and call at the start of each
ndo_start_xmit function in net/ipv4 and net/ipv6.

Minor changes:
- convert dev_kfree_skb to kfree_skb on error path,
  as dev_kfree_skb calls consume_skb which is not for error paths.
- use pskb_network_may_pull even though that is pedantic here,
  as the same as pskb_may_pull for devices without llheaders.
- do not cache ipv6 hdrs if used only once
  (unsafe across pskb_may_pull, was more relevant to earlier patch)

Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-01 12:05:02 -08:00
Deepa Dinamani
3a0ed3e961 sock: Make sock->sk_stamp thread-safe
Al Viro mentioned (Message-ID
<20170626041334.GZ10672@ZenIV.linux.org.uk>)
that there is probably a race condition
lurking in accesses of sk_stamp on 32-bit machines.

sock->sk_stamp is of type ktime_t which is always an s64.
On a 32 bit architecture, we might run into situations of
unsafe access as the access to the field becomes non atomic.

Use seqlocks for synchronization.
This allows us to avoid using spinlocks for readers as
readers do not need mutual exclusion.

Another approach to solve this is to require sk_lock for all
modifications of the timestamps. The current approach allows
for timestamps to have their own lock: sk_stamp_lock.
This allows for the patch to not compete with already
existing critical sections, and side effects are limited
to the paths in the patch.

The addition of the new field maintains the data locality
optimizations from
commit 9115e8cd2a ("net: reorganize struct sock for better data
locality")

Note that all the instances of the sk_stamp accesses
are either through the ioctl or the syscall recvmsg.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-01-01 09:47:59 -08:00
Pablo Neira Ayuso
c80f10bc97 netfilter: nf_conncount: speculative garbage collection on empty lists
Instead of removing a empty list node that might be reintroduced soon
thereafter, tentatively place the empty list node on the list passed to
tree_nodes_free(), then re-check if the list is empty again before erasing
it from the tree.

[ Florian: rebase on top of pending nf_conncount fixes ]

Fixes: 5c789e131c ("netfilter: nf_conncount: Add list lock and gc worker, and RCU for init tree search")
Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:22 +01:00
Florian Westphal
df4a902509 netfilter: nf_conncount: merge lookup and add functions
'lookup' is always followed by 'add'.
Merge both and make the list-walk part of nf_conncount_add().

This also avoids one unneeded unlock/re-lock pair.

Extra care needs to be taken in count_tree, as we only hold rcu
read lock, i.e. we can only insert to an existing tree node after
acquiring its lock and making sure it has a nonzero count.

As a zero count should be rare, just fall back to insert_tree()
(which acquires tree lock).

This issue and its solution were pointed out by Shawn Bohrer
during patch review.

Reviewed-by: Shawn Bohrer <sbohrer@cloudflare.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-29 02:45:20 +01:00
Peter Oskolkov
c92c81df93 net: dccp: fix kernel crash on module load
Patch eedbbb0d98 "net: dccp: initialize (addr,port) ..."
added calling to inet_hashinfo2_init() from dccp_init().

However, inet_hashinfo2_init() is marked as __init(), and
thus the kernel panics when dccp is loaded as module. Removing
__init() tag from inet_hashinfo2_init() is not feasible because
it calls into __init functions in mm.

This patch adds inet_hashinfo2_init_mod() function that can
be called after the init phase is done; changes dccp_init() to
call the new function; un-marks inet_hashinfo2_init() as
exported.

Fixes: eedbbb0d98 ("net: dccp: initialize (addr,port) ...")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-24 15:27:56 -08:00
David S. Miller
c3e5336925 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for net-next:

1) Support for destination MAC in ipset, from Stefano Brivio.

2) Disallow all-zeroes MAC address in ipset, also from Stefano.

3) Add IPSET_CMD_GET_BYNAME and IPSET_CMD_GET_BYINDEX commands,
   introduce protocol version number 7, from Jozsef Kadlecsik.
   A follow up patch to fix ip_set_byindex() is also included
   in this batch.

4) Honor CTA_MARK_MASK from ctnetlink, from Andreas Jaggi.

5) Statify nf_flow_table_iterate(), from Taehee Yoo.

6) Use nf_flow_table_iterate() to simplify garbage collection in
   nf_flow_table logic, also from Taehee Yoo.

7) Don't use _bh variants of call_rcu(), rcu_barrier() and
   synchronize_rcu_bh() in Netfilter, from Paul E. McKenney.

8) Remove NFC_* cache definition from the old caching
   infrastructure.

9) Remove layer 4 port rover in NAT helpers, use random port
   instead, from Florian Westphal.

10) Use strscpy() in ipset, from Qian Cai.

11) Remove NF_NAT_RANGE_PROTO_RANDOM_FULLY branch now that
    random port is allocated by default, from Xiaozhou Liu.

12) Ignore NF_NAT_RANGE_PROTO_RANDOM too, from Florian Westphal.

13) Limit port allocation selection routine in NAT to avoid
    softlockup splats when most ports are in use, from Florian.

14) Remove unused parameters in nf_ct_l4proto_unregister_sysctl()
    from Yafang Shao.

15) Direct call to nf_nat_l4proto_unique_tuple() instead of
    indirection, from Florian Westphal.

16) Several patches to remove all layer 4 NAT indirections,
    remove nf_nat_l4proto struct, from Florian Westphal.

17) Fix RTP/RTCP source port translation when SNAT is in place,
    from Alin Nastac.

18) Selective rule dump per chain, from Phil Sutter.

19) Revisit CLUSTERIP target, this includes a deadlock fix from
    netns path, sleep in atomic, remove bogus WARN_ON_ONCE()
    and disallow mismatching IP address and MAC address.
    Patchset from Taehee Yoo.

20) Update UDP timeout to stream after 2 seconds, from Florian.

21) Shrink UDP established timeout to 120 seconds like TCP timewait.

22) Sysctl knobs to set GRE timeouts, from Yafang Shao.

23) Move seq_print_acct() to conntrack core file, from Florian.

24) Add enum for conntrack sysctl knobs, also from Florian.

25) Place nf_conntrack_acct, nf_conntrack_helper, nf_conntrack_events
    and nf_conntrack_timestamp knobs in the core, from Florian Westphal.
    As a side effect, shrink netns_ct structure by removing obsolete
    sysctl anchors, also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 18:20:26 -08:00
David S. Miller
339bbff2d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-12-21

The following pull-request contains BPF updates for your *net-next* tree.

There is a merge conflict in test_verifier.c. Result looks as follows:

        [...]
        },
        {
                "calls: cross frame pruning",
                .insns = {
                [...]
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .errstr_unpriv = "function calls to other bpf functions are allowed for root only",
                .result_unpriv = REJECT,
                .errstr = "!read_ok",
                .result = REJECT,
	},
        {
                "jset: functional",
                .insns = {
        [...]
        {
                "jset: unknown const compare not taken",
                .insns = {
                        BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0,
                                     BPF_FUNC_get_prandom_u32),
                        BPF_JMP_IMM(BPF_JSET, BPF_REG_0, 1, 1),
                        BPF_LDX_MEM(BPF_B, BPF_REG_8, BPF_REG_9, 0),
                        BPF_EXIT_INSN(),
                },
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .errstr_unpriv = "!read_ok",
                .result_unpriv = REJECT,
                .errstr = "!read_ok",
                .result = REJECT,
        },
        [...]
        {
                "jset: range",
                .insns = {
                [...]
                },
                .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
                .result_unpriv = ACCEPT,
                .result = ACCEPT,
        },

The main changes are:

1) Various BTF related improvements in order to get line info
   working. Meaning, verifier will now annotate the corresponding
   BPF C code to the error log, from Martin and Yonghong.

2) Implement support for raw BPF tracepoints in modules, from Matt.

3) Add several improvements to verifier state logic, namely speeding
   up stacksafe check, optimizations for stack state equivalence
   test and safety checks for liveness analysis, from Alexei.

4) Teach verifier to make use of BPF_JSET instruction, add several
   test cases to kselftests and remove nfp specific JSET optimization
   now that verifier has awareness, from Jakub.

5) Improve BPF verifier's slot_type marking logic in order to
   allow more stack slot sharing, from Jiong.

6) Add sk_msg->size member for context access and add set of fixes
   and improvements to make sock_map with kTLS usable with openssl
   based applications, from John.

7) Several cleanups and documentation updates in bpftool as well as
   auto-mount of tracefs for "bpftool prog tracelog" command,
   from Quentin.

8) Include sub-program tags from now on in bpf_prog_info in order to
   have a reliable way for user space to get all tags of the program
   e.g. needed for kallsyms correlation, from Song.

9) Add BTF annotations for cgroup_local_storage BPF maps and
   implement bpf fs pretty print support, from Roman.

10) Fix bpftool in order to allow for cross-compilation, from Ivan.

11) Update of bpftool license to GPLv2-only + BSD-2-Clause in order
    to be compatible with libbfd and allow for Debian packaging,
    from Jakub.

12) Remove an obsolete prog->aux sanitation in dump and get rid of
    version check for prog load, from Daniel.

13) Fix a memory leak in libbpf's line info handling, from Prashant.

14) Fix cpumap's frame alignment for build_skb() so that skb_shared_info
    does not get unaligned, from Jesper.

15) Fix test_progs kselftest to work with older compilers which are less
    smart in optimizing (and thus throwing build error), from Stanislav.

16) Cleanup and simplify AF_XDP socket teardown, from Björn.

17) Fix sk lookup in BPF kselftest's test_sock_addr with regards
    to netns_id argument, from Andrey.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 17:31:36 -08:00
Peter Oskolkov
a6ae520def net: seg6.h: remove an unused #include
A minor code cleanup.

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 16:56:04 -08:00
Florian Westphal
8527f9df04 netfilter: netns: shrink netns_ct struct
remove the obsolete sysctl anchors and move auto_assign_helper_warned
to avoid/cover a hole.  Reduces size by 40 bytes on 64 bit.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:56 +01:00
Florian Westphal
fc3893fd5c netfilter: conntrack: remove empty pernet fini stubs
after moving sysctl handling into single place, the init functions
can't fail anymore and some of the fini functions are empty.

Remove them and change return type to void.
This also simplifies error unwinding in conntrack module init path.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:54 +01:00
Florian Westphal
4b216e21cf netfilter: conntrack: un-export seq_print_acct
Only one caller, just place it where its needed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:51:39 +01:00
Florian Westphal
d535c8a69c netfilter: conntrack: udp: only extend timeout to stream mode after 2s
Currently DNS resolvers that send both A and AAAA queries from same source port
can trigger stream mode prematurely, which results in non-early-evictable conntrack entry
for three minutes, even though DNS requests are done in a few milliseconds.

Add a two second grace period where we continue to use the ordinary
30-second default timeout.  Its enough for DNS request/response traffic,
even if two request/reply packets are involved.

ASSURED is still set, else conntrack (and thus a possible
NAT mapping ...) gets zapped too in case conntrack table runs full.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-21 00:48:38 +01:00
John Fastabend
0608c69c9a bpf: sk_msg, sock{map|hash} redirect through ULP
A sockmap program that redirects through a kTLS ULP enabled socket
will not work correctly because the ULP layer is skipped. This
fixes the behavior to call through the ULP layer on redirect to
ensure any operations required on the data stream at the ULP layer
continue to be applied.

To do this we add an internal flag MSG_SENDPAGE_NOPOLICY to avoid
calling the BPF layer on a redirected message. This is
required to avoid calling the BPF layer multiple times (possibly
recursively) which is not the current/expected behavior without
ULPs. In the future we may add a redirect flag if users _do_
want the policy applied again but this would need to work for both
ULP and non-ULP sockets and be opt-in to avoid breaking existing
programs.

Also to avoid polluting the flag space with an internal flag we
reuse the flag space overlapping MSG_SENDPAGE_NOPOLICY with
MSG_WAITFORONE. Here WAITFORONE is specific to recv path and
SENDPAGE_NOPOLICY is only used for sendpage hooks. The last thing
to verify is user space API is masked correctly to ensure the flag
can not be set by user. (Note this needs to be true regardless
because we have internal flags already in-use that user space
should not be able to set). But for completeness we have two UAPI
paths into sendpage, sendfile and splice.

In the sendfile case the function do_sendfile() zero's flags,

./fs/read_write.c:
 static ssize_t do_sendfile(int out_fd, int in_fd, loff_t *ppos,
		   	    size_t count, loff_t max)
 {
   ...
   fl = 0;
#if 0
   /*
    * We need to debate whether we can enable this or not. The
    * man page documents EAGAIN return for the output at least,
    * and the application is arguably buggy if it doesn't expect
    * EAGAIN on a non-blocking file descriptor.
    */
    if (in.file->f_flags & O_NONBLOCK)
	fl = SPLICE_F_NONBLOCK;
#endif
    file_start_write(out.file);
    retval = do_splice_direct(in.file, &pos, out.file, &out_pos, count, fl);
 }

In the splice case the pipe_to_sendpage "actor" is used which
masks flags with SPLICE_F_MORE.

./fs/splice.c:
 static int pipe_to_sendpage(struct pipe_inode_info *pipe,
			    struct pipe_buffer *buf, struct splice_desc *sd)
 {
   ...
   more = (sd->flags & SPLICE_F_MORE) ? MSG_MORE : 0;
   ...
 }

Confirming what we expect that internal flags  are in fact internal
to socket side.

Fixes: d3b18ad31f ("tls: add bpf support to sk_msg handling")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-12-20 23:47:09 +01:00
David S. Miller
2be09de7d6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Lots of conflicts, by happily all cases of overlapping
changes, parallel adds, things of that nature.

Thanks to Stephen Rothwell, Saeed Mahameed, and others
for their guidance in these resolutions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-20 11:53:36 -08:00
wenxu
1875a9ab01 iptunnel: make TUNNEL_FLAGS available in uapi
ip l add dev tun type gretap external
ip r a 10.0.0.1 encap ip dst 192.168.152.171 id 1000 dev gretap

For gretap Key example when the command set the id but don't set the
TUNNEL_KEY flags. There is no key field in the send packet

In the lwtunnel situation, some TUNNEL_FLAGS should can be set by
userspace

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 15:58:01 -08:00
Roopa Prabhu
82cbb5c631 neighbour: register rtnl doit handler
this patch registers neigh doit handler. The doit handler
returns a neigh entry given dst and dev. This is similar
to route and fdb doit (get) handlers. Also moves nda_policy
declaration from rtnetlink.c to neighbour.c

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Reviewed-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 13:37:34 -08:00
Florian Westphal
4165079ba3 net: switch secpath to use skb extension infrastructure
Remove skb->sp and allocate secpath storage via extension
infrastructure.  This also reduces sk_buff by 8 bytes on x86_64.

Total size of allyesconfig kernel is reduced slightly, as there is
less inlined code (one conditional atomic op instead of two on
skb_clone).

No differences in throughput in following ipsec performance tests:
- transport mode with aes on 10GB link
- tunnel mode between two network namespaces with aes and null cipher

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:38 -08:00
Florian Westphal
26912e3756 xfrm: use secpath_exist where applicable
Will reduce noise when skb->sp is removed later in this series.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal
2294be0f11 net: use skb_sec_path helper in more places
skb_sec_path gains 'const' qualifier to avoid
xt_policy.c: 'skb_sec_path' discards 'const' qualifier from pointer target type

same reasoning as previous conversions: Won't need to touch these
spots anymore when skb->sp is removed.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal
7af8f4ca31 net: move secpath_exist helper to sk_buff.h
Future patch will remove skb->sp pointer.
To reduce noise in those patches, move existing helper to
sk_buff and use it in more places to ease skb->sp replacement later.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal
0ca64da128 xfrm: change secpath_set to return secpath struct, not error value
It can only return 0 (success) or -ENOMEM.
Change return value to a pointer to secpath struct.

This avoids direct access to skb->sp:

err = secpath_set(skb);
if (!err) ..
skb->sp-> ...

Becomes:
sp = secpath_set(skb)
if (!sp) ..
sp-> ..

This reduces noise in followup patch which is going to remove skb->sp.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal
de8bda1d22 net: convert bridge_nf to use skb extension infrastructure
This converts the bridge netfilter (calling iptables hooks from bridge)
facility to use the extension infrastructure.

The bridge_nf specific hooks in skb clone and free paths are removed, they
have been replaced by the skb_ext hooks that do the same as the bridge nf
allocations hooks did.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
Florian Westphal
c4b0e771f9 netfilter: avoid using skb->nf_bridge directly
This pointer is going to be removed soon, so use the existing helpers in
more places to avoid noise when the removal happens.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 11:21:37 -08:00
David S. Miller
5a862f86b8 This time we have too many changes to list, highlights:
* virt_wifi - wireless control simulation on top of
    another network interface
  * hwsim configurability to test capabilities similar
    to real hardware
  * various mesh improvements
  * various radiotap vendor data fixes in mac80211
  * finally the nl_set_extack_cookie_u64() we talked
    about previously, used for
  * peer measurement APIs, right now only with FTM
    (flight time measurement) for location
  * made nl80211 radio/interface announcements more complete
  * various new HE (802.11ax) things:
    updates, TWT support, ...
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlwaCwkACgkQB8qZga/f
 l8S7mA/+I1CJmGC7Pvy+SBFkzoY5zEjjzgZYL6sGo16qMs89NPcURSe5j+uCsDP3
 nKEjsvhQMYDfGNLTJJfWbDpGwm9LnKp69AFITlvfzmP6Sm36QMZr7oIC4abi8cW4
 osaO3qfdaNoZ//x72jgjrFhUAnphvT2BsRVMNEjz7sXcDd7Jm9NnpRhV8zgXFvLF
 dS2Ng51LM/BLMz5jQpyJUDZeeL/iBYybCecyckmVqzXPh1icIZETSqZXiN4ngv2A
 6p9BSGNtP6wmjnbkvZz5RDq76VhTPZWsTgTpVb45Wf1k2fm1rB96UgpqvfQtjTgB
 +7Zx2WRpMXM5OjGkwaEs8nawFmt7MHCGnhLPLWPCbXc685fhp3OFShysMJdYS/GZ
 IIRJ7+IchAQX1yluftB+NkQM9sBDjyseMBwxHRYkj/rQVhoLY1sT+ke7lkuV10o6
 DQqfpUTZAsIz7zkuscn7hkNdI/Rjub6BZjbrs1Jt9zSt9WQUBao23XudOI0j5JDa
 ErnfC5PISXMQWik5B9M1Zhq3H9qCI2Swh19lMmtxtSDQ9yrLrJkEJ5SA+aHoxNHj
 wSxBc3XXSW47qPXGX/D5DNnbOcOrE7kVZuD8YqRsy8VedyjIgEw7oQ21flAD4FC4
 R4TgbNkqpfZQsU29gaMkDkYXnfQDB/G9FOk6ARGxjBPjT55Hz0E=
 =EpyK
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2018-12-19' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
This time we have too many changes to list, highlights:
 * virt_wifi - wireless control simulation on top of
   another network interface
 * hwsim configurability to test capabilities similar
   to real hardware
 * various mesh improvements
 * various radiotap vendor data fixes in mac80211
 * finally the nl_set_extack_cookie_u64() we talked
   about previously, used for
 * peer measurement APIs, right now only with FTM
   (flight time measurement) for location
 * made nl80211 radio/interface announcements more complete
 * various new HE (802.11ax) things:
   updates, TWT support, ...
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-19 08:36:18 -08:00
David S. Miller
fde9cd69a5 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec
Steffen Klassert says:

====================
pull request (net): ipsec 2018-12-18

1) Fix error return code in xfrm_output_one()
   when no dst_entry is attached to the skb.
   From Wei Yongjun.

2) The xfrm state hash bucket count reported to
   userspace is off by one. Fix from Benjamin Poirier.

3) Fix NULL pointer dereference in xfrm_input when
   skb_dst_force clears the dst_entry.

4) Fix freeing of xfrm states on acquire. We use a
   dedicated slab cache for the xfrm states now,
   so free it properly with kmem_cache_free.
   From Mathias Krause.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18 11:43:26 -08:00
David S. Miller
77c7a7b3e7 Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next
Steffen Klassert says:

====================
pull request (net-next): ipsec-next 2018-12-18

1) Add xfrm policy selftest scripts.
   From Florian Westphal.

2) Split inexact policies into four different search list
   classes and use the rbtree infrastructure to store/lookup
   the policies. This is to improve the policy lookup
   performance after the flowcache removal.
   Patches from Florian Westphal.

3) Various coding style fixes, from Colin Ian King.

4) Fix policy lookup logic after adding the inexact policy
   search tree infrastructure. From Florian Westphal.

5) Remove a useless remove BUG_ON from xfrm6_dst_ifdown.
   From Li RongQing.

6) Use the correct policy direction for lookups on hash
   rebuilding. From Florian Westphal.

Please pull or let me know if there are problems.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-18 08:49:48 -08:00
Emmanuel Grumbach
55ebd6e6c7 mac80211: propagate the support for TWT to the driver
TWT is a feature that was added in 11ah and enhanced in
11ax. There are two bits that need to be set if we want
to use the feature in 11ax: one in the HE Capability IE
and one in the Extended Capability IE. This is because
of backward compatibility between 11ah and 11ax.

In order to simplify the flow for the low level driver
in managed mode, aggregate the two bits and add a boolean
that tells whether TWT is supported or not, but only if
11ax is supported.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18 14:18:49 +01:00
Johannes Berg
fca1279f09 mac80211: document RCU requirements for ieee80211_tx_dequeue()
In the iwlwifi conversion, we sometimes call this from outside
of the wake_tx_queue() method, and in those cases must be in an
RCU critical section. Document this requirement.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18 13:18:30 +01:00
Johannes Berg
30db641ef4 cfg80211: clarify LCI/civic location documentation
The older code and current userspace assumed that this data
is the content of the Measurement Report element, starting
with the Measurement Token. Clarify this in the documentation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18 13:15:04 +01:00
Randy Dunlap
3453de9814 wireless: FTM: fix kernel-doc "cannot understand" warnings
Fix kernel-doc warnings in FTM due to missing "struct" keyword.

Fixes 109 warnings from <net/cfg80211.h>:
../include/net/cfg80211.h:2838: warning: cannot understand function prototype: 'struct cfg80211_ftm_responder_stats '

and fixes 88 warnings from <net/mac80211.h>:
../include/net/mac80211.h:477: warning: cannot understand function prototype: 'struct ieee80211_ftm_responder_params '

Fixes: 81e54d08d9 ("cfg80211: support FTM responder configuration/statistics")
Fixes: bc847970f4 ("mac80211: support FTM responder configuration/statistics")

Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
Cc: Pradeep Kumar Chitrapu <pradeepc@codeaurora.org>
Cc: Johannes Berg <johannes.berg@intel.com>
Cc: David Spinadel <david.spinadel@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-12-18 13:14:23 +01:00
Willem de Bruijn
8f932f762e net: add missing SOF_TIMESTAMPING_OPT_ID support
SOF_TIMESTAMPING_OPT_ID is supported on TCP, UDP and RAW sockets.
But it was missing on RAW with IPPROTO_IP, PF_PACKET and CAN.

Add skb_setup_tx_timestamp that configures both tx_flags and tskey
for these paths that do not need corking or use bytestream keys.

Fixes: 09c2d251b7 ("net-timestamp: add key to disambiguate concurrent datagrams")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-17 23:27:00 -08:00
Florian Westphal
5cbabeec1e netfilter: nat: remove nf_nat_l4proto struct
This removes the (now empty) nf_nat_l4proto struct, all its instances
and all the no longer needed runtime (un)register functionality.

nf_nat_need_gre() can be axed as well: the module that calls it (to
load the no-longer-existing nat_gre module) also calls other nat core
functions. GRE nat is now always available if kernel is built with it.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:31 +01:00
Florian Westphal
faec18dbb0 netfilter: nat: remove l4proto->manip_pkt
This removes the last l4proto indirection, the two callers, the l3proto
packet mangling helpers for ipv4 and ipv6, now call the
nf_nat_l4proto_manip_pkt() helper.

nf_nat_proto_{dccp,tcp,sctp,gre,icmp,icmpv6} are left behind, even though
they contain no functionality anymore to not clutter this patch.

Next patch will remove the empty files and the nf_nat_l4proto
struct.

nf_nat_proto_udp.c is renamed to nf_nat_proto.c, as it now contains the
other nat manip functionality as well, not just udp and udplite.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:29 +01:00
Florian Westphal
76b90019e0 netfilter: nat: remove l4proto->nlattr_to_range
all protocols did set this to nf_nat_l4proto_nlattr_to_range, so
just call it directly.

The important difference is that we'll now also call it for
protocols that we don't support (i.e., nf_nat_proto_unknown did
not provide .nlattr_to_range).

However, there should be no harm, even icmp provided this callback.
If we don't implement a specific l4nat for this, nothing would make
use of this information, so adding a big switch/case construct listing
all supported l4protocols seems a bit pointless.

This change leaves a single function pointer in the l4proto struct.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:23 +01:00
Florian Westphal
fe2d002099 netfilter: nat: remove l4proto->in_range
With exception of icmp, all of the l4 nat protocols set this to
nf_nat_l4proto_in_range.

Get rid of this and just check the l4proto in the caller.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:14 +01:00
Florian Westphal
40e786bd29 netfilter: nat: fold in_range indirection into caller
No need for indirections here, we only support ipv4 and ipv6
and the called functions are very small.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:09 +01:00
Florian Westphal
203f2e7820 netfilter: nat: remove l4proto->unique_tuple
fold remaining users (icmp, icmpv6, gre) into nf_nat_l4proto_unique_tuple.
The static-save of old incarnation of resolved key in gre and icmp is
removed as well, just use the prandom based offset like the others.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:33:04 +01:00
Florian Westphal
716b23c19e netfilter: nat: un-export nf_nat_l4proto_unique_tuple
almost all l4proto->unique_tuple implementations just call this helper,
so make ->unique_tuple() optional and call its helper directly if the
l4proto doesn't override it.

This is an intermediate step to get rid of ->unique_tuple completely.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:57 +01:00
Florian Westphal
912da924a2 netfilter: remove NF_NAT_RANGE_PROTO_RANDOM support
Historically this was net_random() based, and was then converted to
a hash based algorithm (private boot seed + hash of endpoint addresses)
due to concerns of leaking net_random() bits.

RANDOM_FULLY mode was added later to avoid problems with hash
based mode (see commit 34ce324019,
"netfilter: nf_nat: add full port randomization support" for details).

Just make prandom_u32() the default search starting point and get rid of
->secure_port() altogether.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-17 23:32:36 +01:00
Tristram Ha
39d6b96f9f net: dsa: ksz: Rename NET_DSA_TAG_KSZ to _KSZ9477
Rename the tag Kconfig option and related macros in preparation for
addition of new KSZ family switches with different tag formats.

Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
Signed-off-by: Marek Vasut <marex@denx.de>
Cc: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Cc: Woojung Huh <woojung.huh@microchip.com>
Cc: David S. Miller <davem@davemloft.net>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-16 14:23:33 -08:00
David Ahern
df9b0e30d4 neighbor: Add protocol attribute
Similar to routes and rules, add protocol attribute to neighbor entries
for easier tracking of how each was created.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-16 12:15:25 -08:00
Paolo Abeni
028e0a4766 net: use indirect call wrappers at GRO transport layer
This avoids an indirect call in the receive path for TCP and UDP
packets. TCP takes precedence on UDP, so that we have a single
additional conditional in the common case.

When IPV6 is build as module, all gro symbols except UDPv6 are
builtin, while the latter belong to the ipv6 module, so we
need some special care.

v1 -> v2:
 - adapted to INDIRECT_CALL_ changes
v2 -> v3:
 - fix build issue with CONFIG_IPV6=m

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15 13:23:02 -08:00
Paolo Abeni
aaa5d90b39 net: use indirect call wrappers at GRO network layer
This avoids an indirect calls for L3 GRO receive path, both
for ipv4 and ipv6, if the latter is not compiled as a module.

Note that when IPv6 is compiled as builtin, it will be checked first,
so we have a single additional compare for the more common path.

v1 -> v2:
 - adapted to INDIRECT_CALL_ changes

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15 13:23:02 -08:00
David Ahern
4b7cd11f22 neighbor: Improve neighbour struct layout
Move arp_queue_len_bytes ahead of arp_queue to remove two 4-byte holes.
Ensure ha element is always 8-byte aligned.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-15 11:49:29 -08:00
David Ahern
526f1b587c neighbor: Move neigh_update_ext_learned to core file
neigh_update_ext_learned has one caller in neighbour.c so does not need
to be defined in the header. Move it and in the process remove the
intialization of ndm_flags and just set it based on the flags check.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-14 15:44:47 -08:00
Cong Wang
aeb3fecde8 net_sched: fold tcf_block_cb_call() into tc_setup_cb_call()
After commit 69bd48404f ("net/sched: Remove egdev mechanism"),
tc_setup_cb_call() is nearly identical to tcf_block_cb_call(),
so we can just fold tcf_block_cb_call() into tc_setup_cb_call()
and remove its unused parameter 'exts'.

Fixes: 69bd48404f ("net/sched: Remove egdev mechanism")
Cc: Oz Shlomo <ozsh@mellanox.com>
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-14 15:32:19 -08:00
Atul Gupta
df9d4a1780 net/tls: sleeping function from invalid context
HW unhash within mutex for registered tls devices cause sleep
when called from tcp_set_state for TCP_CLOSE. Release lock and
re-acquire after function call with ref count incr/dec.
defined kref and fp release for tls_device to ensure device
is not released outside lock.

BUG: sleeping function called from invalid context at
kernel/locking/mutex.c:748
in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/7
INFO: lockdep is turned off.
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G        W  O
Call Trace:
 <IRQ>
 dump_stack+0x5e/0x8b
 ___might_sleep+0x222/0x260
 __mutex_lock+0x5c/0xa50
 ? vprintk_emit+0x1f3/0x440
 ? kmem_cache_free+0x22d/0x2a0
 ? tls_hw_unhash+0x2f/0x80
 ? printk+0x52/0x6e
 ? tls_hw_unhash+0x2f/0x80
 tls_hw_unhash+0x2f/0x80
 tcp_set_state+0x5f/0x180
 tcp_done+0x2e/0xe0
 tcp_rcv_state_process+0x92c/0xdd3
 ? lock_acquire+0xf5/0x1f0
 ? tcp_v4_rcv+0xa7c/0xbe0
 ? tcp_v4_do_rcv+0x70/0x1e0

Signed-off-by: Atul Gupta <atul.gupta@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-14 13:39:39 -08:00
Petr Machata
6921351359 net: switchdev: Add extack to switchdev_handle_port_obj_add() callback
Drivers use switchdev_handle_port_obj_add() to handle recursive descent
through lower devices. Change this function prototype to take add_cb
that itself takes an extack argument. Decode extack from
switchdev_notifier_port_obj_info and pass it to add_cb.

Update mlxsw and ocelot drivers which use this helper.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12 16:34:22 -08:00
Petr Machata
479c86dc55 net: switchdev: Add extack to struct switchdev_notifier_info
In order to pass extack to the drivers that need it, add an extack field
to struct switchdev_notifier_info, and an extack argument to the
function call_switchdev_blocking_notifiers(). Also add a helper function
switchdev_notifier_info_to_extack().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12 16:34:22 -08:00
Petr Machata
69b7320e14 net: switchdev: Add extack argument to switchdev_port_obj_add()
After the previous patch, bridge driver has extack argument available to
pass to switchdev. Therefore extend switchdev_port_obj_add() with this
argument, updating all callers, and passing the argument through to
switchdev_port_obj_notify().

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Ivan Vecera <ivecera@redhat.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-12 16:34:21 -08:00
Oz Shlomo
69bd48404f net/sched: Remove egdev mechanism
The egdev mechanism was replaced by the TC indirect block notifications
platform.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Cc: John Hurley <john.hurley@netronome.com>
Cc: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-10 15:54:34 -08:00
Oz Shlomo
0621e6fc5e net: Add netif_is_gretap()/netif_is_ip6gretap()
Changed the is_gretap_dev and is_ip6gretap_dev logic from structure
comparison to string comparison of the rtnl_link_ops kind field.

This approach aligns with the current identification methods and function
names of vxlan and geneve network devices.

Convert mlxsw to use these helpers and use them in downstream mlx5 patch.

Signed-off-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Eli Britstein <elibr@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
2018-12-10 15:53:04 -08:00
Colin Ian King
77990464bb xfrm: clean an indentation issue, remove a space
Trivial fix to clean up indentation issue, remove an extraneous
space.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-12-10 12:30:40 +01:00
David S. Miller
4cc1feeb6f Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Several conflicts, seemingly all over the place.

I used Stephen Rothwell's sample resolutions for many of these, if not
just to double check my own work, so definitely the credit largely
goes to him.

The NFP conflict consisted of a bug fix (moving operations
past the rhashtable operation) while chaning the initial
argument in the function call in the moved code.

The net/dsa/master.c conflict had to do with a bug fix intermixing of
making dsa_master_set_mtu() static with the fixing of the tagging
attribute location.

cls_flower had a conflict because the dup reject fix from Or
overlapped with the addition of port range classifiction.

__set_phy_supported()'s conflict was relatively easy to resolve
because Andrew fixed it in both trees, so it was just a matter
of taking the net-next copy.  Or at least I think it was :-)

Joe Stringer's fix to the handling of netns id 0 in bpf_sk_lookup()
intermixed with changes on how the sdif and caller_net are calculated
in these code paths in net-next.

The remaining BPF conflicts were largely about the addition of the
__bpf_md_ptr stuff in 'net' overlapping with adjustments and additions
to the relevant data structure where the MD pointer macros are used.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-09 21:43:31 -08:00
Stefano Brivio
e6ac64d4c4 neighbour: Avoid writing before skb->head in neigh_hh_output()
While skb_push() makes the kernel panic if the skb headroom is less than
the unaligned hardware header size, it will proceed normally in case we
copy more than that because of alignment, and we'll silently corrupt
adjacent slabs.

In the case fixed by the previous patch,
"ipv6: Check available headroom in ip6_xmit() even without options", we
end up in neigh_hh_output() with 14 bytes headroom, 14 bytes hardware
header and write 16 bytes, starting 2 bytes before the allocated buffer.

Always check we're not writing before skb->head and, if the headroom is
not enough, warn and drop the packet.

v2:
 - instead of panicking with BUG_ON(), WARN_ON_ONCE() and drop the packet
   (Eric Dumazet)
 - if we avoid the panic, though, we need to explicitly check the headroom
   before the memcpy(), otherwise we'll have corrupted slabs on a running
   kernel, after we warn
 - use __skb_push() instead of skb_push(), as the headroom check is
   already implemented here explicitly (Eric Dumazet)

Signed-off-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 16:24:40 -08:00
David Ahern
58956317c8 neighbor: Improve garbage collection
The existing garbage collection algorithm has a number of problems:

1. The gc algorithm will not evict PERMANENT entries as those entries
   are managed by userspace, yet the existing algorithm walks the entire
   hash table which means it always considers PERMANENT entries when
   looking for entries to evict. In some use cases (e.g., EVPN) there
   can be tens of thousands of PERMANENT entries leading to wasted
   CPU cycles when gc kicks in. As an example, with 32k permanent
   entries, neigh_alloc has been observed taking more than 4 msec per
   invocation.

2. Currently, when the number of neighbor entries hits gc_thresh2 and
   the last flush for the table was more than 5 seconds ago gc kicks in
   walks the entire hash table evicting *all* entries not in PERMANENT
   or REACHABLE state and not marked as externally learned. There is no
   discriminator on when the neigh entry was created or if it just moved
   from REACHABLE to another NUD_VALID state (e.g., NUD_STALE).

   It is possible for entries to be created or for established neighbor
   entries to be moved to STALE (e.g., an external node sends an ARP
   request) right before the 5 second window lapses:

        -----|---------x|----------|-----
            t-5         t         t+5

   If that happens those entries are evicted during gc causing unnecessary
   thrashing on neighbor entries and userspace caches trying to track them.

   Further, this contradicts the description of gc_thresh2 which says
   "Entries older than 5 seconds will be cleared".

   One workaround is to make gc_thresh2 == gc_thresh3 but that negates the
   whole point of having separate thresholds.

3. Clearing *all* neigh non-PERMANENT/REACHABLE/externally learned entries
   when gc_thresh2 is exceeded is over kill and contributes to trashing
   especially during startup.

This patch addresses these problems as follows:

1. Use of a separate list_head to track entries that can be garbage
   collected along with a separate counter. PERMANENT entries are not
   added to this list.

   The gc_thresh parameters are only compared to the new counter, not the
   total entries in the table. The forced_gc function is updated to only
   walk this new gc_list looking for entries to evict.

2. Entries are added to the list head at the tail and removed from the
   front.

3. Entries are only evicted if they were last updated more than 5 seconds
   ago, adhering to the original intent of gc_thresh2.

4. Forced gc is stopped once the number of gc_entries drops below
   gc_thresh2.

5. Since gc checks do not apply to PERMANENT entries, gc levels are skipped
   when allocating a new neighbor for a PERMANENT entry. By extension this
   means there are no explicit limits on the number of PERMANENT entries
   that can be created, but this is no different than FIB entries or FDB
   entries.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 16:03:10 -08:00
Petr Machata
e5ff4b1952 vxlan: Add vxlan_fdb_clear_offload()
When a driver unoffloads all FDB entries en bloc, it's inefficient to
send the switchdev notification one by one. Add a helper that walks the
FDB table, unsetting the offload flag on RDST with a given VNI.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 12:59:08 -08:00
Petr Machata
4f89f5b535 vxlan: Add vxlan_fdb_replay()
When a VXLAN device becomes relevant to a driver (such as when it is
attached to an offloaded bridge), the driver will generally need to walk
the existing FDB entries and offload them.

Add a function vxlan_fdb_replay() to call a given notifier block for
each FDB entry with a given VNI.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-07 12:59:08 -08:00
Andrew Lunn
a5dd308778 net: dsa: Add overhead to tag protocol ops.
Each DSA tag protocol needs to add additional headers to the Ethernet
frame in order to direct it towards a specific switch egress port. It
must also remove the head from a frame received from a
switch. Indicate the maximum size of these headers in the tag protocol
ops structure, so the core can take these overheads into account.

Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-06 12:18:16 -08:00
Jakub Audykowicz
afd0a8006e sctp: frag_point sanity check
If for some reason an association's fragmentation point is zero,
sctp_datamsg_from_user will try to endlessly try to divide a message
into zero-sized chunks. This eventually causes kernel panic due to
running out of memory.

Although this situation is quite unlikely, it has occurred before as
reported. I propose to add this simple last-ditch sanity check due to
the severity of the potential consequences.

Signed-off-by: Jakub Audykowicz <jakub.audykowicz@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-05 20:37:52 -08:00
Eric Dumazet
a74f0fa082 tcp: reduce POLLOUT events caused by TCP_NOTSENT_LOWAT
TCP_NOTSENT_LOWAT socket option or sysctl was added in linux-3.12
as a step to enable bigger tcp sndbuf limits.

It works reasonably well, but the following happens :

Once the limit is reached, TCP stack generates
an [E]POLLOUT event for every incoming ACK packet.

This causes a high number of context switches.

This patch implements the strategy David Miller added
in sock_def_write_space() :

 - If TCP socket has a notsent_lowat constraint of X bytes,
   allow sendmsg() to fill up to X bytes, but send [E]POLLOUT
   only if number of notsent bytes is below X/2

This considerably reduces TCP_NOTSENT_LOWAT overhead,
while allowing to keep the pipe full.

Tested:
 100 ms RTT netem testbed between A and B, 100 concurrent TCP_STREAM

A:/# cat /proc/sys/net/ipv4/tcp_wmem
4096	262144	64000000
A:/# super_netperf 100 -H B -l 1000 -- -K bbr &

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 1364904 # This is about 54 MB of memory per flow :/

A:/# vmstat 5 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 256220672  13532 694976    0    0    10     0   28   14  0  1 99  0  0
 2  0      0 256320016  13532 698480    0    0   512     0 715901 5927  0 10 90  0  0
 0  0      0 256197232  13532 700992    0    0   735    13 771161 5849  0 11 89  0  0
 1  0      0 256233824  13532 703320    0    0   512    23 719650 6635  0 11 89  0  0
 2  0      0 256226880  13532 705780    0    0   642     4 775650 6009  0 12 88  0  0

A:/# echo 2097152 >/proc/sys/net/ipv4/tcp_notsent_lowat

A:/# grep TCP /proc/net/sockstat
TCP: inuse 203 orphan 0 tw 19 alloc 414 mem 86411 # 3.5 MB per flow

A:/# vmstat 5 5  # check that context switches have not inflated too much.
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 2  0      0 260386512  13592 662148    0    0    10     0   17   14  0  1 99  0  0
 0  0      0 260519680  13592 604184    0    0   512    13 726843 12424  0 10 90  0  0
 1  1      0 260435424  13592 598360    0    0   512    25 764645 12925  0 10 90  0  0
 1  0      0 260855392  13592 578380    0    0   512     7 722943 13624  0 11 88  0  0
 1  0      0 260445008  13592 601176    0    0   614    34 772288 14317  0 10 90  0  0

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-04 21:21:18 -08:00
Xin Long
fb6df5a623 sctp: kfree_rcu asoc
In sctp_hash_transport/sctp_epaddr_lookup_transport, it dereferences
a transport's asoc under rcu_read_lock while asoc is freed not after
a grace period, which leads to a use-after-free panic.

This patch fixes it by calling kfree_rcu to make asoc be freed after
a grace period.

Note that only the asoc's memory is delayed to free in the patch, it
won't cause sk to linger longer.

Thanks Neil and Marcelo to make this clear.

Fixes: 7fda702f93 ("sctp: use new rhlist interface on sctp transport rhashtable")
Fixes: cd2b708750 ("sctp: check duplicate node before inserting a new transport")
Reported-by: syzbot+0b05d8aa7cb185107483@syzkaller.appspotmail.com
Reported-by: syzbot+aad231d51b1923158444@syzkaller.appspotmail.com
Suggested-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 15:54:41 -08:00
Alexis Bauvin
6a6d6681ac l3mdev: add function to retreive upper master
Existing functions to retreive the l3mdev of a device did not walk the
master chain to find the upper master. This patch adds a function to
find the l3mdev, even indirect through e.g. a bridge:

+----------+
|          |
| vrf-blue |
|          |
+----+-----+
     |
     |
+----+-----+
|          |
| br-blue  |
|          |
+----+-----+
     |
     |
+----+-----+
|          |
|   eth0   |
|          |
+----------+

This will properly resolve the l3mdev of eth0 to vrf-blue.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 14:15:26 -08:00
Alexis Bauvin
da5095d052 udp_tunnel: add config option to bind to a device
UDP tunnel sockets are always opened unbound to a specific device. This
patch allow the socket to be bound on a custom device, which
incidentally makes UDP tunnels VRF-aware if binding to an l3mdev.

Signed-off-by: Alexis Bauvin <abauvin@scaleway.com>
Reviewed-by: Amine Kherbouche <akherbouche@scaleway.com>
Tested-by: Amine Kherbouche <akherbouche@scaleway.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 14:15:26 -08:00
Shalom Toledo
846e980a87 devlink: Add 'fw_load_policy' generic parameter
Many drivers load the device's firmware image during the initialization
flow either from the flash or from the disk. Currently this option is not
controlled by the user and the driver decides from where to load the
firmware image.

'fw_load_policy' gives the ability to control this option which allows the
user to choose between different loading policies supported by the driver.

This parameter can be useful while testing and/or debugging the device. For
example, testing a firmware bug fix.

Signed-off-by: Shalom Toledo <shalomt@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-12-03 13:55:43 -08:00
Florian Westphal
6ed5943f87 netfilter: nat: remove l4 protocol port rovers
This is a leftover from days where single-cpu systems were common:
Store last port used to resolve a clash to use it as a starting point when
the next conflict needs to be resolved.

When we have parallel attempt to connect to same address:port pair,
its likely that both cores end up computing the same "available" port,
as both use same starting port, and newly used ports won't become
visible to other cores until the conntrack gets confirmed later.

One of the cores then has to drop the packet at insertion time because
the chosen new tuple turns out to be in use after all.

Lets simplify this: remove port rover and use a pseudo-random starting
point.

Note that this doesn't make netfilter default to 'fully random' mode;
the 'rover' was only used if NAT could not reuse source port as-is.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-12-01 12:38:42 +01:00
Paolo Abeni
bf1c3ab8d3 net: reorder flowi_common fields to avoid holes
the flowi* structures are used and memsetted by server functions
in critical path. Currently flowi_common has a couple of holes that
we can eliminate reordering the struct fields. As a side effect,
both flowi4 and flowi6 shrink by 8 bytes.

Before:
pahole -EC flowi_common
struct flowi_common {
// ...
	/* size: 40, cachelines: 1, members: 10 */
	/* sum members: 32, holes: 1, sum holes: 4 */
	/* padding: 4 */
	/* last cacheline: 40 bytes */
};
pahole -EC flowi6
struct flowi6 {
// ...
        /* size: 88, cachelines: 2, members: 6 */
        /* padding: 4 */
        /* last cacheline: 24 bytes */
};
pahole -EC flowi4
struct flowi4 {
// ...
        /* size: 56, cachelines: 1, members: 4 */
        /* padding: 4 */
        /* last cacheline: 56 bytes */
};

After:
struct flowi_common {
// ...
	/* size: 32, cachelines: 1, members: 10 */
	/* last cacheline: 32 bytes */
};
struct flowi6 {
// ...
        /* size: 80, cachelines: 2, members: 6 */
        /* padding: 4 */
        /* last cacheline: 16 bytes */
};
struct flowi4 {
// ...
        /* size: 48, cachelines: 1, members: 4 */
        /* padding: 4 */
        /* last cacheline: 48 bytes */
};

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 17:12:39 -08:00
Eric Dumazet
6015c71e65 tcp: md5: add tcp_md5_needed jump label
Most linux hosts never setup TCP MD5 keys. We can avoid a
cache line miss (accessing tp->md5ig_info) on RX and TX
using a jump label.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:28:03 -08:00
Eric Dumazet
85bdf7db5b tcp: make tcp_space() aware of socket backlog
Jean-Louis Dupond reported poor iscsi TCP receive performance
that we tracked to backlog drops.

Apparently we fail to send window updates reflecting the
fact that we are under stress.

Note that we might lack a proper window increase when
backlog is fully processed, since __release_sock() clears
sk->sk_backlog.len _after_ all skbs have been processed.

This should not matter in practice. If we had a significant
load through socket backlog, we are in a dangerous
situation.

Reported-by: Jean-Louis Dupond <jean-louis@dupond.be>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Tested-by: Jean-Louis Dupond<jean-louis@dupond.be>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:26:54 -08:00
Eric Dumazet
ebeef4bccc tcp: hint compiler about sack flows
Tell the compiler that most TCP flows are using SACK these days.

There is no need to add the unlikely() clause in tcp_is_reno(),
the compiler is able to infer it.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:26:53 -08:00
Edward Cree
91c459561b net/flow_dissector: correct comments on enum flow_dissector_key_id
There are no such structs flow_dissector_key_flow_vlan or
 flow_dissector_key_flow_tags, the actual structs used are struct
 flow_dissector_key_vlan and struct flow_dissector_key_tags.  So correct the
 comments against FLOW_DISSECTOR_KEY_VLAN, FLOW_DISSECTOR_KEY_FLOW_LABEL and
 FLOW_DISSECTOR_KEY_CVLAN to refer to those.

Signed-off-by: Edward Cree <ecree@solarflare.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-30 13:21:52 -08:00
David S. Miller
e561bb29b6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Trivial conflict in net/core/filter.c, a locally computed
'sdif' is now an argument to the function.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 22:10:54 -08:00
David S. Miller
e9d8faf93d Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Disable BH while holding list spinlock in nf_conncount, from
   Taehee Yoo.

2) List corruption in nf_conncount, also from Taehee.

3) Fix race that results in leaving around an empty list node in
   nf_conncount, from Taehee Yoo.

4) Proper chain handling for inactive chains from the commit path,
   from Florian Westphal. This includes a selftest for this.

5) Do duplicate rule handles when replacing rules, also from Florian.

6) Remove net_exit path in xt_RATEEST that results in splat, from Taehee.

7) Possible use-after-free in nft_compat when releasing extensions.
   From Florian.

8) Memory leak in xt_hashlimit, from Taehee.

9) Call ip_vs_dst_notifier after ipv6_dev_notf, from Xin Long.

10) Fix cttimeout with udplite and gre, from Florian.

11) Preserve oif for IPv6 link-local generated traffic from mangle
    table, from Alin Nastac.

12) Missing error handling in masquerade notifiers, from Taehee Yoo.

13) Use mutex to protect registration/unregistration of masquerade
    extensions in order to prevent a race, from Taehee.

14) Incorrect condition check in tree_nodes_free(), also from Taehee.

15) Fix chain counter leak in rule replacement path, from Taehee.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-28 11:02:45 -08:00
Taehee Yoo
584eab291c netfilter: add missing error handling code for register functions
register_{netdevice/inetaddr/inet6addr}_notifier may return an error
value, this patch adds the code to handle these error paths.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-11-27 00:35:19 +01:00
David S. Miller
b1bf78bfb2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-24 17:01:43 -08:00
Petr Machata
d17d9f5e51 switchdev: Replace port obj add/del SDO with a notification
Drop switchdev_ops.switchdev_port_obj_add and _del. Drop the uses of
this field from all clients, which were migrated to use switchdev
notification in the previous patches.

Add a new function switchdev_port_obj_notify() that sends the switchdev
notifications SWITCHDEV_PORT_OBJ_ADD and _DEL.

Update switchdev_port_obj_del_now() to dispatch to this new function.
Drop __switchdev_port_obj_add() and update switchdev_port_obj_add()
likewise.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23 18:02:24 -08:00
Petr Machata
f30f0601eb switchdev: Add helpers to aid traversal through lower devices
After the transition from switchdev operations to notifier chain (which
will take place in following patches), the onus is on the driver to find
its own devices below possible layer of LAG or other uppers.

The logic to do so is fairly repetitive: each driver is looking for its
own devices among the lowers of the notified device. For those that it
finds, it calls a handler. To indicate that the event was handled,
struct switchdev_notifier_port_obj_info.handled is set. The differences
lie only in what constitutes an "own" device and what handler to call.

Therefore abstract this logic into two helpers,
switchdev_handle_port_obj_add() and switchdev_handle_port_obj_del(). If
a driver only supports physical ports under a bridge device, it will
simply avoid this layer of indirection.

One area where this helper diverges from the current switchdev behavior
is the case of mixed lowers, some of which are switchdev ports and some
of which are not. Previously, such scenario would fail with -EOPNOTSUPP.
The helper could do that for lowers for which the passed-in predicate
doesn't hold. That would however break the case that switchdev ports
from several different drivers are stashed under one master, a scenario
that switchdev currently happily supports. Therefore tolerate any and
all unknown netdevices, whether they are backed by a switchdev driver
or not.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23 18:02:24 -08:00
Petr Machata
aa4efe2139 switchdev: Add SWITCHDEV_PORT_OBJ_ADD, SWITCHDEV_PORT_OBJ_DEL
An offloading driver may need to have access to switchdev events on
ports that aren't directly under its control. An example is a VXLAN port
attached to a bridge offloaded by a driver. The driver needs to know
about VLANs configured on the VXLAN device. However the VXLAN device
isn't stashed between the bridge and a front-panel-port device (such as
is the case e.g. for LAG devices), so the usual switchdev ops don't
reach the driver.

VXLAN is likely not the only device type like this: in theory any L2
tunnel device that needs offloading will prompt requirement of this
sort. This falsifies the assumption that only the lower devices of a
front panel port need to be notified to achieve flawless offloading.

A way to fix this is to give up the notion of port object addition /
deletion as a switchdev operation, which assumes somewhat tight coupling
between the message producer and consumer. And instead send the message
over a notifier chain.

To that end, introduce two new switchdev notifier types,
SWITCHDEV_PORT_OBJ_ADD and SWITCHDEV_PORT_OBJ_DEL. These notifier types
communicate the same event as the corresponding switchdev op, except in
a form of a notification. struct switchdev_notifier_port_obj_info was
added to carry the fields that the switchdev op carries. An additional
field, handled, will be used to communicate back to switchdev that the
event has reached an interested party, which will be important for the
two-phase commit.

The two switchdev operations themselves are kept in place. Following
patches first convert individual clients to the notifier protocol, and
only then are the operations removed.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23 18:02:23 -08:00
Petr Machata
a93e3b1722 switchdev: Add a blocking notifier chain
In general one can't assume that a switchdev notifier is called in a
non-atomic context, and correspondingly, the switchdev notifier chain is
an atomic one.

However, port object addition and deletion messages are delivered from a
process context. Even the MDB addition messages, whose delivery is
scheduled from atomic context, are queued and the delivery itself takes
place in blocking context. For VLAN messages in particular, keeping the
blocking nature is important for error reporting.

Therefore introduce a blocking notifier chain and related service
functions to distribute the notifications for which a blocking context
can be assumed.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23 18:02:23 -08:00
Petr Machata
ec394af5ea switchdev: SWITCHDEV_OBJ_PORT_{VLAN, MDB}(): Sanitize
The two macros SWITCHDEV_OBJ_PORT_VLAN() and SWITCHDEV_OBJ_PORT_MDB()
expand to a container_of() call, yielding an appropriate container of
their sole argument. However, due to a name collision, the first
argument, i.e. the contained object pointer, is not the only one to get
expanded. The third argument, which is a structure member name, and
should be kept literal, gets expanded as well. The only safe way to use
these two macros is therefore to name the local variable passed to them
"obj".

To fix this, rename the sole argument of the two macros from
"obj" (which collides with the member name) to "OBJ". Additionally,
instead of passing "OBJ" to container_of() verbatim, parenthesize it, so
that a comma in the passed-in expression doesn't pollute the
container_of() invocation.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-23 18:02:23 -08:00
Mathias Krause
4a135e5389 xfrm_user: fix freeing of xfrm states on acquire
Commit 565f0fa902 ("xfrm: use a dedicated slab cache for struct
xfrm_state") moved xfrm state objects to use their own slab cache.
However, it missed to adapt xfrm_user to use this new cache when
freeing xfrm states.

Fix this by introducing and make use of a new helper for freeing
xfrm_state objects.

Fixes: 565f0fa902 ("xfrm: use a dedicated slab cache for struct xfrm_state")
Reported-by: Pan Bian <bianpan2016@163.com>
Cc: <stable@vger.kernel.org> # v4.18+
Signed-off-by: Mathias Krause <minipli@googlemail.com>
Acked-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-11-23 07:51:32 +01:00
Petr Machata
5728ae0d17 vxlan: Add hardware FDB learning
In order to allow devices to signal learning events to VXLAN, introduce
two new switchdev messages: SWITCHDEV_VXLAN_FDB_ADD_TO_BRIDGE and
SWITCHDEV_VXLAN_FDB_DEL_TO_BRIDGE.

Listen to these notifications in the vxlan driver. The FDB entries
learned this way have an NTF_EXT_LEARNED flag, and only entries marked
as such can be unlearned by the _DEL_ event. They are also immediately
marked as offloaded. This is the same behavior that the bridge driver
observes.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21 17:10:31 -08:00
Petr Machata
45598c1cee vxlan: Mark user-added FDB entries
The VXLAN driver needs to differentiate between FDB entries learned by
the VXLAN driver, and those added by the user. The latter ones shouldn't
be taken over by external learning events. This is in accordance with
bridge behavior.

Therefore, extend the flags bitfield to 16 bits and add a new private
NTF flag to mark the user-added entries.

This seems preferable to adding a dedicated boolean, because passing the
flag, unlike passing e.g. a true, makes it clear what the meaning of the
bit is.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-21 17:10:30 -08:00
Jakub Kicinski
068ceb3555 net: sched: cls_u32: add res to offload information
In case of egress offloads the class/flowid assigned by the filter
may be very important for offloaded Qdisc selection.  Provide this
info to drivers.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 18:53:46 -08:00
Jakub Kicinski
e49efd5288 net: sched: gred: support reporting stats from offloads
Allow drivers which offload GRED to report back statistics.  Since
A lot of GRED stats is fairly ad hoc in nature pass to drivers the
standard struct gnet_stats_basic/gnet_stats_queue pairs, and
untangle the values in the core.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 18:53:46 -08:00
Jakub Kicinski
890d8d23ec net: sched: gred: add basic Qdisc offload
Add basic offload for the GRED Qdisc.  Inform the drivers any
time Qdisc or virtual queue configuration changes.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 18:53:45 -08:00
Xin Long
69fec325a6 Revert "sctp: remove sctp_transport_pmtu_check"
This reverts commit 22d7be267e.

The dst's mtu in transport can be updated by a non sctp place like
in xfrm where the MTU information didn't get synced between asoc,
transport and dst, so it is still needed to do the pmtu check
in sctp_packet_config.

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 12:42:47 -08:00
Xin Long
88ee48c1f3 sctp: rename enum sctp_event to sctp_event_type
sctp_event is a structure name defined in RFC for sockopt
SCTP_EVENT. To avoid the conflict, rename it.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 12:25:43 -08:00
Xin Long
a1e3a0590f sctp: add subscribe per asoc
The member subscribe should be per asoc, so that sockopt SCTP_EVENT
in the next patch can subscribe a event from one asoc only.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 12:25:43 -08:00
Xin Long
2cc0eeb676 sctp: define subscribe in sctp_sock as __u16
The member subscribe in sctp_sock is used to indicate to which of
the events it is subscribed, more like a group of flags. So it's
better to be defined as __u16 (2 bytpes), instead of struct
sctp_event_subscribe (13 bytes).

Note that sctp_event_subscribe is an UAPI struct, used on sockopt
calls, and thus it will not be removed. This patch only changes
the internal storage of the flags.

Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-19 12:25:43 -08:00
David S. Miller
f2be6d710d Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-11-19 10:55:00 -08:00
Eric Dumazet
001c96db01 net: align gnet_stats_basic_cpu struct
This structure is small (12 or 16 bytes depending on 64bit
or 32bit kernels), but we do not want it spanning two cache lines.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-17 21:37:29 -08:00
Yafang Shao
213d7767af tcp: clean up STATE_TRACE
Currently we can use bpf or tcp tracepoint to conveniently trace the tcp
state transition at the run time.
So we don't need to do this stuff at the compile time anymore.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-16 20:28:00 -08:00
Cong Wang
6ab6dfa6bb net: get rid of __tcp_checksum_complete()
__tcp_checksum_complete() is 100% same with __skb_checksum_complete()
and there is no other caller except tcp_checksum_complete().
So, just use __skb_checksum_complete() there.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-15 11:41:22 -08:00
David Howells
7150ceaacb rxrpc: Fix life check
The life-checking function, which is used by kAFS to make sure that a call
is still live in the event of a pending signal, only samples the received
packet serial number counter; it doesn't actually provoke a change in the
counter, rather relying on the server to happen to give us a packet in the
time window.

Fix this by adding a function to force a ping to be transmitted.

kAFS then keeps track of whether there's been a stall, and if so, uses the
new function to ping the server, resetting the timeout to allow the reply
to come back.

If there's a stall, a ping and the call is *still* stalled in the same
place after another period, then the call will be aborted.

Fixes: bc5e3a546d ("rxrpc: Use MSG_WAITALL to tell sendmsg() to temporarily ignore signals")
Fixes: f4d15fb6f9 ("rxrpc: Provide functions for allowing cleaner handling of signals")
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-15 11:35:40 -08:00
Jakub Kicinski
c0b7490b19 net: sched: red: notify drivers about RED's limit parameter
RED qdisc's limit parameter changes the behaviour of the qdisc,
for instance if it's set to 0 qdisc will drop all the packets.

When replace operation happens and parameter is set to non-0
a new fifo qdisc will be instantiated and replace the old child
qdisc which will be destroyed.

Drivers need to know the parameter, even if they don't impose
the actual limit to be able to reliably reconstruct the Qdisc
hierarchy.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-14 08:51:28 -08:00
Jakub Kicinski
d577a3d279 net: sched: mq: offload a graft notification
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path.  This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused.  Send a graft notification from MQ.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-14 08:51:28 -08:00
Jakub Kicinski
bf2a752bea net: sched: red: offload a graft notification
Drivers offloading Qdiscs should have reasonable certainty
the offloaded behaviour matches the SW path.  This is impossible
if the driver does not know about all Qdiscs or when Qdiscs move
and are reused.  Send a graft notification from RED.  The drivers
are expected to simply stop offloading the Qdisc, if a non-standard
child is ever grafted onto it.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-14 08:51:28 -08:00
Jakub Kicinski
98b0e5f684 net: sched: provide notification for graft on root
Drivers are currently not notified when a Qdisc is grafted as root.
This requires special casing Qdiscs added with parent = TC_H_ROOT in
the driver.  Also there is no notification sent to the driver when
an existing Qdisc is grafted as root.

Add this very simple notifications, drivers should now be able to
track their Qdisc tree fully.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: John Hurley <john.hurley@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-14 08:51:27 -08:00
Xin Long
6ba8457402 sctp: process sk_reuseport in sctp_get_port_local
When socks' sk_reuseport is set, the same port and address are allowed
to be bound into these socks who have the same uid.

Note that the difference from sk_reuse is that it allows multiple socks
to listen on the same port and address.

Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-11-12 09:09:51 -08:00