Commit Graph

54964 Commits

Author SHA1 Message Date
Vlad Buslov
c266f64dbf net: sched: protect block state with mutex
Currently, tcf_block doesn't use any synchronization mechanisms to protect
critical sections that manage lifetime of its chains. block->chain_list and
multiple variables in tcf_chain that control its lifetime assume external
synchronization provided by global rtnl lock. Converting chain reference
counting to atomic reference counters is not possible because cls API uses
multiple counters and flags to control chain lifetime, so all of them must
be synchronized in chain get/put code.

Use single per-block lock to protect block data and manage lifetime of all
chains on the block. Always take block->lock when accessing chain_list.
Chain get and put modify chain lifetime-management data and parent block's
chain_list, so take the lock in these functions. Verify block->lock state
with assertions in functions that expect to be called with the lock taken
and are called from multiple places. Take block->lock when accessing
filter_chain_list.

In order to allow parallel update of rules on single block, move all calls
to classifiers outside of critical sections protected by new block->lock.
Rearrange chain get and put functions code to only access protected chain
data while holding block lock:
- Rearrange code to only access chain reference counter and chain action
  reference counter while holding block lock.
- Extract code that requires block->lock from tcf_chain_destroy() into
  standalone tcf_chain_destroy() function that is called by
  __tcf_chain_put() in same critical section that changes chain reference
  counters.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:41:32 -05:00
Kal Conley
fc62814d69 net/packet: fix 4gb buffer limit due to overflow check
When calculating rb->frames_per_block * req->tp_block_nr the result
can overflow. Check it for overflow without limiting the total buffer
size to UINT_MAX.

This change fixes support for packet ring buffers >= UINT_MAX.

Fixes: 8f8d28e4d6 ("net/packet: fix overflow in check for tp_frame_nr")
Signed-off-by: Kal Conley <kal.conley@dectris.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:37:23 -05:00
Konstantin Khlebnikov
1ec17dbd90 inet_diag: fix reporting cgroup classid and fallback to priority
Field idiag_ext in struct inet_diag_req_v2 used as bitmap of requested
extensions has only 8 bits. Thus extensions starting from DCTCPINFO
cannot be requested directly. Some of them included into response
unconditionally or hook into some of lower 8 bits.

Extension INET_DIAG_CLASS_ID has not way to request from the beginning.

This patch bundle it with INET_DIAG_TCLASS (ipv6 tos), fixes space
reservation, and documents behavior for other extensions.

Also this patch adds fallback to reporting socket priority. This filed
is more widely used for traffic classification because ipv4 sockets
automatically maps TOS to priority and default qdisc pfifo_fast knows
about that. But priority could be changed via setsockopt SO_PRIORITY so
INET_DIAG_TOS isn't enough for predicting class.

Also cgroup2 obsoletes net_cls classid (it always zero), but we cannot
reuse this field for reporting cgroup2 id because it is 64-bit (ino+gen).

So, after this patch INET_DIAG_CLASS_ID will report socket priority
for most common setup when net_cls isn't set and/or cgroup2 in use.

Fixes: 0888e372c3 ("net: inet: diag: expose sockets cgroup classid")
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:35:57 -05:00
Eric Dumazet
4ffcbfac60 batman-adv: fix uninit-value in batadv_interface_tx()
KMSAN reported batadv_interface_tx() was possibly using a
garbage value [1]

batadv_get_vid() does have a pskb_may_pull() call
but batadv_interface_tx() does not actually make sure
this did not fail.

[1]
BUG: KMSAN: uninit-value in batadv_interface_tx+0x908/0x1e40 net/batman-adv/soft-interface.c:231
CPU: 0 PID: 10006 Comm: syz-executor469 Not tainted 4.20.0-rc7+ #5
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x173/0x1d0 lib/dump_stack.c:113
 kmsan_report+0x12e/0x2a0 mm/kmsan/kmsan.c:613
 __msan_warning+0x82/0xf0 mm/kmsan/kmsan_instr.c:313
 batadv_interface_tx+0x908/0x1e40 net/batman-adv/soft-interface.c:231
 __netdev_start_xmit include/linux/netdevice.h:4356 [inline]
 netdev_start_xmit include/linux/netdevice.h:4365 [inline]
 xmit_one net/core/dev.c:3257 [inline]
 dev_hard_start_xmit+0x607/0xc40 net/core/dev.c:3273
 __dev_queue_xmit+0x2e42/0x3bc0 net/core/dev.c:3843
 dev_queue_xmit+0x4b/0x60 net/core/dev.c:3876
 packet_snd net/packet/af_packet.c:2928 [inline]
 packet_sendmsg+0x8306/0x8f30 net/packet/af_packet.c:2953
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg net/socket.c:631 [inline]
 __sys_sendto+0x8c4/0xac0 net/socket.c:1788
 __do_sys_sendto net/socket.c:1800 [inline]
 __se_sys_sendto+0x107/0x130 net/socket.c:1796
 __x64_sys_sendto+0x6e/0x90 net/socket.c:1796
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
 entry_SYSCALL_64_after_hwframe+0x63/0xe7
RIP: 0033:0x441889
Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 bb 10 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007ffdda6fd468 EFLAGS: 00000216 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 0000000000441889
RDX: 000000000000000e RSI: 00000000200000c0 RDI: 0000000000000003
RBP: 0000000000000003 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000216 R12: 00007ffdda6fd4c0
R13: 00007ffdda6fd4b0 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
 kmsan_save_stack_with_flags mm/kmsan/kmsan.c:204 [inline]
 kmsan_internal_poison_shadow+0x92/0x150 mm/kmsan/kmsan.c:158
 kmsan_kmalloc+0xa6/0x130 mm/kmsan/kmsan_hooks.c:176
 kmsan_slab_alloc+0xe/0x10 mm/kmsan/kmsan_hooks.c:185
 slab_post_alloc_hook mm/slab.h:446 [inline]
 slab_alloc_node mm/slub.c:2759 [inline]
 __kmalloc_node_track_caller+0xe18/0x1030 mm/slub.c:4383
 __kmalloc_reserve net/core/skbuff.c:137 [inline]
 __alloc_skb+0x309/0xa20 net/core/skbuff.c:205
 alloc_skb include/linux/skbuff.h:998 [inline]
 alloc_skb_with_frags+0x1c7/0xac0 net/core/skbuff.c:5220
 sock_alloc_send_pskb+0xafd/0x10e0 net/core/sock.c:2083
 packet_alloc_skb net/packet/af_packet.c:2781 [inline]
 packet_snd net/packet/af_packet.c:2872 [inline]
 packet_sendmsg+0x661a/0x8f30 net/packet/af_packet.c:2953
 sock_sendmsg_nosec net/socket.c:621 [inline]
 sock_sendmsg net/socket.c:631 [inline]
 __sys_sendto+0x8c4/0xac0 net/socket.c:1788
 __do_sys_sendto net/socket.c:1800 [inline]
 __se_sys_sendto+0x107/0x130 net/socket.c:1796
 __x64_sys_sendto+0x6e/0x90 net/socket.c:1796
 do_syscall_64+0xbc/0xf0 arch/x86/entry/common.c:291
 entry_SYSCALL_64_after_hwframe+0x63/0xe7

Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc:	Marek Lindner <mareklindner@neomailbox.ch>
Cc:	Simon Wunderlich <sw@simonwunderlich.de>
Cc:	Antonio Quartulli <a@unstable.cc>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 13:30:43 -05:00
Li RongQing
d1f20798a1 ipv6: propagate genlmsg_reply return code
genlmsg_reply can fail, so propagate its return code

Fixes: 915d7e5e59 ("ipv6: sr: add code base for control plane support of SR-IPv6")
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 12:36:02 -05:00
Vakul Garg
c0ab4732d4 net/tls: Do not use async crypto for non-data records
Addition of tls1.3 support broke tls1.2 handshake when async crypto
accelerator is used. This is because the record type for non-data
records is not propagated to user application. Also when async
decryption happens, the decryption does not stop when two different
types of records get dequeued and submitted for decryption. To address
it, we decrypt tls1.2 non-data records in synchronous way. We check
whether the record we just processed has same type as the previous one
before checking for async condition and jumping to dequeue next record.

Fixes: 130b392c6c ("net: tls: Add tls 1.3 support")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 12:35:14 -05:00
Li RongQing
fde55ea74c devlink: use direct return of genlmsg_reply
This can remove redundant check

Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 12:34:19 -05:00
Vasundhara Volam
da203dfa89 Revert "devlink: Add a generic wake_on_lan port parameter"
This reverts commit b639583f9e.

As per discussion with Jakub Kicinski and Michal Kubecek,
this will be better addressed by soon-too-come ethtool netlink
API with additional indication that given configuration request
is supposed to be persisted.

Also, remove the parameter support from bnxt_en driver.

Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Michal Kubecek <mkubecek@suse.cz>
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Vasundhara Volam <vasundhara-v.volam@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 12:13:22 -05:00
Karsten Graul
81cf643052 net/smc: check port_idx of ib event
For robustness protect of higher port numbers than expected to avoid
setting bits behind our port_event_mask. In case of an DEVICE_FATAL
event all ports must be checked. The IB_EVENT_GID_CHANGE event is
provided in the global event handler, so handle it there. And handle a
QP_FATAL event instead of an DEVICE_FATAL event in the qp handler.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:59:45 -05:00
Karsten Graul
e78b262226 net/smc: check connections in smc_lgr_free_work
Remove the shortcut that smc_lgr_free() would skip the check for
existing connections when the link group is not in the link group list.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:59:45 -05:00
Karsten Graul
cf0cfe5358 net/smc: reduce amount of status updates to peer
In smc_cdc_msg_recv_action() the received cdc message is evaluated.
To reduce the number of messaged triggered by this evaluation the logic
is streamlined. For the write_blocked condition we do not need to send
a response immediately. The remaining conditions can be put together
into one if clause.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:59:45 -05:00
Karsten Graul
16297d1439 net/smc: no delay for free tx buffer wait
When no free transfer buffers are available then a work to call
smc_tx_work() is scheduled. Set the schedule delay to zero, because for
the out-of-buffers condition the work can start immediately and will
block in the called function smc_wr_tx_get_free_slot(), waiting for free
buffers.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:59:45 -05:00
Karsten Graul
5bc056d8d0 net/smc: move wake up of close waiter
Move the call to smc_close_wake_tx_prepared() (which wakes up a possibly
waiting close processing that might wait for 'all data sent') to
smc_tx_sndbuf_nonempty() (which is the main function to send data).

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:59:44 -05:00
Karsten Graul
4dff63c25e net/smc: reset cursor update required flag
When an updated rx_cursor_confirmed field was sent to the peer then
reset the cons_curs_upd_req flag. And remove the duplicate reset and
cursor update in smc_tx_consumer_update().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:59:44 -05:00
David S. Miller
1ea06107b0 Just a few fixes:
* aggregation session teardown with internal TXQs was
    continuing to send some frames marked as aggregation,
    fix from Ilan
  * IBSS join was missed during firmware restart, should
    such a thing happen
  * speculative execution based on the return value of
    cfg80211_classify8021d() - which is controlled by the
    sender of the packet - could be problematic in some
    code using it, prevent it
  * a few peer measurement fixes
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAlxisjgACgkQB8qZga/f
 l8T5aw//WUuc0SIHWvEAMVjndz6uz3y0nLiurVEwi/kMggRx9HYQZpp0P4wyQEDK
 KZ8tOpzvzfeE7hJe8Tz/6mrvkgxHOGJyxTCikB03z3lI+oRbqrfLJOONCqKafWua
 T/9E2LZmwz++8eodShWcVgcm3S0VAZCjFXybNQhRT3BT4rM3v+px8cVw8CtbW2G/
 IOLrgbkr9c8DuaJltO287/y8vjUJZTZHbFcYDFKdNmo+gX/zpKXFSjWrB3zkMdZr
 66bsiTJ7OP4efpPl1WY4Wbmo4yRt+gkN4aMmlI0ssa04IoQuET0SoyAPXR/T0KL5
 rEt1jO2PcNOLBND4QHHfjzT17zmqZaI30/1xnOuo6RE7b7plQEglZn/ssZcmchSD
 UPLShFx/bule+4+hyRhn3FdhMKkrUL+6SdYVPs8p6AneGV1auH22SPJB8GDJfkjM
 qWm8SKgvph+A2JzJzWzcUFXZoOlsnxP+hYdwseWiPimK/k63Afe1KCE2Is8PgLwZ
 v2o3qNmerxM2jwxGYa8dk5ezfvLAqYM1X/qhgodguUQKnFCAnuIxjoxR8o3uuiwe
 bbynlpN5NG8tG32SY1y5fU7TDBeyBZO9BtpHhyPu3p/tIDX63XZApoe1fy/V/q6M
 y+ETFzDL18yrfivplrSSa3i2hej+sjKmpPxHsgFd6I++7CdMZcI=
 =NssM
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2019-02-12' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just a few fixes:
 * aggregation session teardown with internal TXQs was
   continuing to send some frames marked as aggregation,
   fix from Ilan
 * IBSS join was missed during firmware restart, should
   such a thing happen
 * speculative execution based on the return value of
   cfg80211_classify8021d() - which is controlled by the
   sender of the packet - could be problematic in some
   code using it, prevent it
 * a few peer measurement fixes
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-12 11:43:19 -05:00
Björn Töpel
1e405c1a3f xsk: do not remove umem from netdevice on fall-back to copy-mode
Commit c9b47cc1fa ("xsk: fix bug when trying to use both copy and
zero-copy on one queue id") stores the umem into the netdev._rx
struct. However, the patch incorrectly removed the umem from the
netdev._rx struct when user-space passed "best-effort" mode
(i.e. select the fastest possible option available), and zero-copy
mode was not available. This commit fixes that.

Fixes: c9b47cc1fa ("xsk: fix bug when trying to use both copy and zero-copy on one queue id")
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-12 17:04:45 +01:00
Andrea Claudi
098e13f5b2 ipvs: fix dependency on nf_defrag_ipv6
ipvs relies on nf_defrag_ipv6 module to manage IPv6 fragmentation,
but lacks proper Kconfig dependencies and does not explicitly
request defrag features.

As a result, if netfilter hooks are not loaded, when IPv6 fragmented
packet are handled by ipvs only the first fragment makes through.

Fix it properly declaring the dependency on Kconfig and registering
netfilter hooks on ip_vs_add_service() and ip_vs_new_dest().

Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Andrea Claudi <aclaudi@redhat.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:24:01 +01:00
Chieh-Min Wang
13f5251fd1 netfilter: conntrack: fix cloned unconfirmed skb->_nfct race in __nf_conntrack_confirm
For bridge(br_flood) or broadcast/multicast packets, they could clone
skb with unconfirmed conntrack which break the rule that unconfirmed
skb->_nfct is never shared.  With nfqueue running on my system, the race
can be easily reproduced with following warning calltrace:

[13257.707525] CPU: 0 PID: 12132 Comm: main Tainted: P        W       4.4.60 #7744
[13257.707568] Hardware name: Qualcomm (Flattened Device Tree)
[13257.714700] [<c021f6dc>] (unwind_backtrace) from [<c021bce8>] (show_stack+0x10/0x14)
[13257.720253] [<c021bce8>] (show_stack) from [<c0449e10>] (dump_stack+0x94/0xa8)
[13257.728240] [<c0449e10>] (dump_stack) from [<c022a7e0>] (warn_slowpath_common+0x94/0xb0)
[13257.735268] [<c022a7e0>] (warn_slowpath_common) from [<c022a898>] (warn_slowpath_null+0x1c/0x24)
[13257.743519] [<c022a898>] (warn_slowpath_null) from [<c06ee450>] (__nf_conntrack_confirm+0xa8/0x618)
[13257.752284] [<c06ee450>] (__nf_conntrack_confirm) from [<c0772670>] (ipv4_confirm+0xb8/0xfc)
[13257.761049] [<c0772670>] (ipv4_confirm) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.769725] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.777108] [<c06e7af0>] (nf_hook_slow) from [<c07f20b4>] (br_nf_post_routing+0x274/0x31c)
[13257.784486] [<c07f20b4>] (br_nf_post_routing) from [<c06e7a60>] (nf_iterate+0x48/0xa8)
[13257.792556] [<c06e7a60>] (nf_iterate) from [<c06e7af0>] (nf_hook_slow+0x30/0xb0)
[13257.800458] [<c06e7af0>] (nf_hook_slow) from [<c07e5580>] (br_forward_finish+0x94/0xa4)
[13257.808010] [<c07e5580>] (br_forward_finish) from [<c07f22ac>] (br_nf_forward_finish+0x150/0x1ac)
[13257.815736] [<c07f22ac>] (br_nf_forward_finish) from [<c06e8df0>] (nf_reinject+0x108/0x170)
[13257.824762] [<c06e8df0>] (nf_reinject) from [<c06ea854>] (nfqnl_recv_verdict+0x3d8/0x420)
[13257.832924] [<c06ea854>] (nfqnl_recv_verdict) from [<c06e940c>] (nfnetlink_rcv_msg+0x158/0x248)
[13257.841256] [<c06e940c>] (nfnetlink_rcv_msg) from [<c06e5564>] (netlink_rcv_skb+0x54/0xb0)
[13257.849762] [<c06e5564>] (netlink_rcv_skb) from [<c06e4ec8>] (netlink_unicast+0x148/0x23c)
[13257.858093] [<c06e4ec8>] (netlink_unicast) from [<c06e5364>] (netlink_sendmsg+0x2ec/0x368)
[13257.866348] [<c06e5364>] (netlink_sendmsg) from [<c069fb8c>] (sock_sendmsg+0x34/0x44)
[13257.874590] [<c069fb8c>] (sock_sendmsg) from [<c06a03dc>] (___sys_sendmsg+0x1ec/0x200)
[13257.882489] [<c06a03dc>] (___sys_sendmsg) from [<c06a11c8>] (__sys_sendmsg+0x3c/0x64)
[13257.890300] [<c06a11c8>] (__sys_sendmsg) from [<c0209b40>] (ret_fast_syscall+0x0/0x34)

The original code just triggered the warning but do nothing. It will
caused the shared conntrack moves to the dying list and the packet be
droppped (nf_ct_resolve_clash returns NF_DROP for dying conntrack).

- Reproduce steps:

+----------------------------+
|          br0(bridge)       |
|                            |
+-+---------+---------+------+
  | eth0|   | eth1|   | eth2|
  |     |   |     |   |     |
  +--+--+   +--+--+   +---+-+
     |         |          |
     |         |          |
  +--+-+     +-+--+    +--+-+
  | PC1|     | PC2|    | PC3|
  +----+     +----+    +----+

iptables -A FORWARD -m mark --mark 0x1000000/0x1000000 -j NFQUEUE --queue-num 100 --queue-bypass

ps: Our nfq userspace program will set mark on packets whose connection
has already been processed.

PC1 sends broadcast packets simulated by hping3:

hping3 --rand-source --udp 192.168.1.255 -i u100

- Broadcast racing flow chart is as follow:

br_handle_frame
  BR_HOOK(NFPROTO_BRIDGE, NF_BR_PRE_ROUTING, br_handle_frame_finish)
  // skb->_nfct (unconfirmed conntrack) is constructed at PRE_ROUTING stage
  br_handle_frame_finish
    // check if this packet is broadcast
    br_flood_forward
      br_flood
        list_for_each_entry_rcu(p, &br->port_list, list) // iterate through each port
          maybe_deliver
            deliver_clone
              skb = skb_clone(skb)
              __br_forward
                BR_HOOK(NFPROTO_BRIDGE, NF_BR_FORWARD,...)
                // queue in our nfq and received by our userspace program
                // goto __nf_conntrack_confirm with process context on CPU 1
    br_pass_frame_up
      BR_HOOK(NFPROTO_BRIDGE, NF_BR_LOCAL_IN,...)
      // goto __nf_conntrack_confirm with softirq context on CPU 0

Because conntrack confirm can happen at both INPUT and POSTROUTING
stage.  So with NFQUEUE running, skb->_nfct with the same unconfirmed
conntrack could race on different core.

This patch fixes a repeating kernel splat, now it is only displayed
once.

Signed-off-by: Chieh-Min Wang <chiehminw@synology.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 11:14:51 +01:00
Tuong Lien
91986ee166 tipc: fix link session and re-establish issues
When a link endpoint is re-created (e.g. after a node reboot or
interface reset), the link session number is varied by random, the peer
endpoint will be synced with this new session number before the link is
re-established.

However, there is a shortcoming in this mechanism that can lead to the
link never re-established or faced with a failure then. It happens when
the peer endpoint is ready in ESTABLISHING state, the 'peer_session' as
well as the 'in_session' flag have been set, but suddenly this link
endpoint leaves. When it comes back with a random session number, there
are two situations possible:

1/ If the random session number is larger than (or equal to) the
previous one, the peer endpoint will be updated with this new session
upon receipt of a RESET_MSG from this endpoint, and the link can be re-
established as normal. Otherwise, all the RESET_MSGs from this endpoint
will be rejected by the peer. In turn, when this link endpoint receives
one ACTIVATE_MSG from the peer, it will move to ESTABLISHED and start
to send STATE_MSGs, but again these messages will be dropped by the
peer due to wrong session.
The peer link endpoint can still become ESTABLISHED after receiving a
traffic message from this endpoint (e.g. a BCAST_PROTOCOL or
NAME_DISTRIBUTOR), but since all the STATE_MSGs are invalid, the link
will be forced down sooner or later!

Even in case the random session number is larger than the previous one,
it can be that the ACTIVATE_MSG from the peer arrives first, and this
link endpoint moves quickly to ESTABLISHED without sending out any
RESET_MSG yet. Consequently, the peer link will not be updated with the
new session number, and the same link failure scenario as above will
happen.

2/ Another situation can be that, the peer link endpoint was reset due
to any reasons in the meantime, its link state was set to RESET from
ESTABLISHING but still in session, i.e. the 'in_session' flag is not
reset...
Now, if the random session number from this endpoint is less than the
previous one, all the RESET_MSGs from this endpoint will be rejected by
the peer. In the other direction, when this link endpoint receives a
RESET_MSG from the peer, it moves to ESTABLISHING and starts to send
ACTIVATE_MSGs, but all these messages will be rejected by the peer too.
As a result, the link cannot be re-established but gets stuck with this
link endpoint in state ESTABLISHING and the peer in RESET!

Solution:

===========

This link endpoint should not go directly to ESTABLISHED when getting
ACTIVATE_MSG from the peer which may belong to the old session if the
link was re-created. To ensure the session to be correct before the
link is re-established, the peer endpoint in ESTABLISHING state will
send back the last session number in ACTIVATE_MSG for a verification at
this endpoint. Then, if needed, a new and more appropriate session
number will be regenerated to force a re-synch first.

In addition, when a link in ESTABLISHING state is reset, its state will
move to RESET according to the link FSM, along with resetting the
'in_session' flag (and the other data) as a normal link reset, it will
also be deleted if requested.

The solution is backward compatible.

Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 21:26:20 -08:00
Jakub Kicinski
68750561dd devlink: don't allocate attrs on the stack
Number of devlink attributes has grown over 128, causing the
following warning:

../net/core/devlink.c: In function ‘devlink_nl_cmd_region_read_dumpit’:
../net/core/devlink.c:3740:1: warning: the frame size of 1064 bytes is larger than 1024 bytes [-Wframe-larger-than=]
 }
  ^

Since the number of attributes is only going to grow allocate
the array dynamically.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 20:39:56 -08:00
Jakub Kicinski
3ceb745baa devlink: fix condition for compat device info
We need the port to be both ethernet and have the rigth netdev,
not one or the other.

Fixes: ddb6e99e2d ("ethtool: add compat for devlink info")
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 20:39:55 -08:00
Zhiqiang Liu
e75913c93f net: fix IPv6 prefix route residue
Follow those steps:
 # ip addr add 2001:123::1/32 dev eth0
 # ip addr add 2001:123:456::2/64 dev eth0
 # ip addr del 2001:123::1/32 dev eth0
 # ip addr del 2001:123:456::2/64 dev eth0
and then prefix route of 2001:123::1/32 will still exist.

This is because ipv6_prefix_equal in check_cleanup_prefix_route
func does not check whether two IPv6 addresses have the same
prefix length. If the prefix of one address starts with another
shorter address prefix, even though their prefix lengths are
different, the return value of ipv6_prefix_equal is true.

Here I add a check of whether two addresses have the same prefix
to decide whether their prefixes are equal.

Fixes: 5b84efecb7 ("ipv6 addrconf: don't cleanup prefix route for IFA_F_NOPREFIXROUTE")
Signed-off-by: Zhiqiang Liu <liuzhiqiang26@huawei.com>
Reported-by: Wenhao Zhang <zhangwenhao8@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 20:36:50 -08:00
Hoang Le
7384b538d3 tipc: fix skb may be leaky in tipc_link_input
When we free skb at tipc_data_input, we return a 'false' boolean.
Then, skb passed to subcalling tipc_link_input in tipc_link_rcv,

<snip>
1303 int tipc_link_rcv:
...
1354    if (!tipc_data_input(l, skb, l->inputq))
1355        rc |= tipc_link_input(l, skb, l->inputq);
</snip>

Fix it by simple changing to a 'true' boolean when skb is being free-ed.
Then, tipc_link_rcv will bypassed to subcalling tipc_link_input as above
condition.

Acked-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <maloy@donjonn.com>
Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 18:35:52 -08:00
Gustavo A. R. Silva
6ca64ef37d netfilter: xt_recent: Use struct_size() in kvzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    void *entry[];
};

size = sizeof(struct foo) + count * sizeof(void *);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);
instance = alloc(size, GFP_KERNEL)

Notice that, in this case, variable sz is not necessary, hence it is
removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Gustavo A. R. Silva
6fde9df6b7 ipvs: Use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Acked-by: Simon Horman <horms+renesas@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Colin Ian King
48ab807c79 netfilter: conntrack: fix indentation issue
A statement in an if block is not indented correctly. Fix this.

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:39:39 +01:00
Francesco Ruggeri
8d29d16d21 netfilter: compat: initialize all fields in xt_init
If a non zero value happens to be in xt[NFPROTO_BRIDGE].cur at init
time, the following panic can be caused by running

% ebtables -t broute -F BROUTING

from a 32-bit user level on a 64-bit kernel. This patch replaces
kmalloc_array with kcalloc when allocating xt.

[  474.680846] BUG: unable to handle kernel paging request at 0000000009600920
[  474.687869] PGD 2037006067 P4D 2037006067 PUD 2038938067 PMD 0
[  474.693838] Oops: 0000 [#1] SMP
[  474.697055] CPU: 9 PID: 4662 Comm: ebtables Kdump: loaded Not tainted 4.19.17-11302235.AroraKernelnext.fc18.x86_64 #1
[  474.707721] Hardware name: Supermicro X9DRT/X9DRT, BIOS 3.0 06/28/2013
[  474.714313] RIP: 0010:xt_compat_calc_jump+0x2f/0x63 [x_tables]
[  474.720201] Code: 40 0f b6 ff 55 31 c0 48 6b ff 70 48 03 3d dc 45 00 00 48 89 e5 8b 4f 6c 4c 8b 47 60 ff c9 39 c8 7f 2f 8d 14 08 d1 fa 48 63 fa <41> 39 34 f8 4c 8d 0c fd 00 00 00 00 73 05 8d 42 01 eb e1 76 05 8d
[  474.739023] RSP: 0018:ffffc9000943fc58 EFLAGS: 00010207
[  474.744296] RAX: 0000000000000000 RBX: ffffc90006465000 RCX: 0000000002580249
[  474.751485] RDX: 00000000012c0124 RSI: fffffffff7be17e9 RDI: 00000000012c0124
[  474.758670] RBP: ffffc9000943fc58 R08: 0000000000000000 R09: ffffffff8117cf8f
[  474.765855] R10: ffffc90006477000 R11: 0000000000000000 R12: 0000000000000001
[  474.773048] R13: 0000000000000000 R14: ffffc9000943fcb8 R15: ffffc9000943fcb8
[  474.780234] FS:  0000000000000000(0000) GS:ffff88a03f840000(0063) knlGS:00000000f7ac7700
[  474.788612] CS:  0010 DS: 002b ES: 002b CR0: 0000000080050033
[  474.794632] CR2: 0000000009600920 CR3: 0000002037422006 CR4: 00000000000606e0
[  474.802052] Call Trace:
[  474.804789]  compat_do_replace+0x1fb/0x2a3 [ebtables]
[  474.810105]  compat_do_ebt_set_ctl+0x69/0xe6 [ebtables]
[  474.815605]  ? try_module_get+0x37/0x42
[  474.819716]  compat_nf_setsockopt+0x4f/0x6d
[  474.824172]  compat_ip_setsockopt+0x7e/0x8c
[  474.828641]  compat_raw_setsockopt+0x16/0x3a
[  474.833220]  compat_sock_common_setsockopt+0x1d/0x24
[  474.838458]  __compat_sys_setsockopt+0x17e/0x1b1
[  474.843343]  ? __check_object_size+0x76/0x19a
[  474.847960]  __ia32_compat_sys_socketcall+0x1cb/0x25b
[  474.853276]  do_fast_syscall_32+0xaf/0xf6
[  474.857548]  entry_SYSENTER_compat+0x6b/0x7a

Signed-off-by: Francesco Ruggeri <fruggeri@arista.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-12 00:35:18 +01:00
Parav Pandit
b904aada80 devlink: Add WARN_ON to catch errors of not cleaning devlink objects
Add WARN_ON to make sure that all sub objects of a devlink device are
cleanedup before freeing the devlink device.
This helps to catch any driver bugs.

Signed-off-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 13:21:25 -08:00
Eric Dumazet
cf657d22ee net/x25: do not hold the cpu too long in x25_new_lci()
Due to quadratic behavior of x25_new_lci(), syzbot was able
to trigger an rcu stall.

Fix this by not blocking BH for the whole duration of
the function, and inserting a reschedule point when possible.

If we care enough, using a bitmap could get rid of the quadratic
behavior.

syzbot report :

rcu: INFO: rcu_preempt self-detected stall on CPU
rcu:    0-...!: (10500 ticks this GP) idle=4fa/1/0x4000000000000002 softirq=283376/283376 fqs=0
rcu:     (t=10501 jiffies g=383105 q=136)
rcu: rcu_preempt kthread starved for 10502 jiffies! g383105 f0x0 RCU_GP_WAIT_FQS(5) ->state=0x402 ->cpu=0
rcu: RCU grace-period kthread stack dump:
rcu_preempt     I28928    10      2 0x80000000
Call Trace:
 context_switch kernel/sched/core.c:2844 [inline]
 __schedule+0x817/0x1cc0 kernel/sched/core.c:3485
 schedule+0x92/0x180 kernel/sched/core.c:3529
 schedule_timeout+0x4db/0xfd0 kernel/time/timer.c:1803
 rcu_gp_fqs_loop kernel/rcu/tree.c:1948 [inline]
 rcu_gp_kthread+0x956/0x17a0 kernel/rcu/tree.c:2105
 kthread+0x357/0x430 kernel/kthread.c:246
 ret_from_fork+0x3a/0x50 arch/x86/entry/entry_64.S:352
NMI backtrace for cpu 0
CPU: 0 PID: 8759 Comm: syz-executor2 Not tainted 5.0.0-rc4+ #51
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 <IRQ>
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 nmi_cpu_backtrace.cold+0x63/0xa4 lib/nmi_backtrace.c:101
 nmi_trigger_cpumask_backtrace+0x1be/0x236 lib/nmi_backtrace.c:62
 arch_trigger_cpumask_backtrace+0x14/0x20 arch/x86/kernel/apic/hw_nmi.c:38
 trigger_single_cpu_backtrace include/linux/nmi.h:164 [inline]
 rcu_dump_cpu_stacks+0x183/0x1cf kernel/rcu/tree.c:1211
 print_cpu_stall kernel/rcu/tree.c:1348 [inline]
 check_cpu_stall kernel/rcu/tree.c:1422 [inline]
 rcu_pending kernel/rcu/tree.c:3018 [inline]
 rcu_check_callbacks.cold+0x500/0xa4a kernel/rcu/tree.c:2521
 update_process_times+0x32/0x80 kernel/time/timer.c:1635
 tick_sched_handle+0xa2/0x190 kernel/time/tick-sched.c:161
 tick_sched_timer+0x47/0x130 kernel/time/tick-sched.c:1271
 __run_hrtimer kernel/time/hrtimer.c:1389 [inline]
 __hrtimer_run_queues+0x33e/0xde0 kernel/time/hrtimer.c:1451
 hrtimer_interrupt+0x314/0x770 kernel/time/hrtimer.c:1509
 local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1035 [inline]
 smp_apic_timer_interrupt+0x120/0x570 arch/x86/kernel/apic/apic.c:1060
 apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:807
 </IRQ>
RIP: 0010:__read_once_size include/linux/compiler.h:193 [inline]
RIP: 0010:queued_write_lock_slowpath+0x13e/0x290 kernel/locking/qrwlock.c:86
Code: 00 00 fc ff df 4c 8d 2c 01 41 83 c7 03 41 0f b6 45 00 41 38 c7 7c 08 84 c0 0f 85 0c 01 00 00 8b 03 3d 00 01 00 00 74 1a f3 90 <41> 0f b6 55 00 41 38 d7 7c eb 84 d2 74 e7 48 89 df e8 6c 0f 4f 00
RSP: 0018:ffff88805f117bd8 EFLAGS: 00000206 ORIG_RAX: ffffffffffffff13
RAX: 0000000000000300 RBX: ffffffff89413ba0 RCX: 1ffffffff1282774
RDX: 0000000000000000 RSI: 0000000000000004 RDI: ffffffff89413ba0
RBP: ffff88805f117c70 R08: 1ffffffff1282774 R09: fffffbfff1282775
R10: fffffbfff1282774 R11: ffffffff89413ba3 R12: 00000000000000ff
R13: fffffbfff1282774 R14: 1ffff1100be22f7d R15: 0000000000000003
 queued_write_lock include/asm-generic/qrwlock.h:104 [inline]
 do_raw_write_lock+0x1d6/0x290 kernel/locking/spinlock_debug.c:203
 __raw_write_lock_bh include/linux/rwlock_api_smp.h:204 [inline]
 _raw_write_lock_bh+0x3b/0x50 kernel/locking/spinlock.c:312
 x25_insert_socket+0x21/0xe0 net/x25/af_x25.c:267
 x25_bind+0x273/0x340 net/x25/af_x25.c:705
 __sys_bind+0x23f/0x290 net/socket.c:1505
 __do_sys_bind net/socket.c:1516 [inline]
 __se_sys_bind net/socket.c:1514 [inline]
 __x64_sys_bind+0x73/0xb0 net/socket.c:1514
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fafccd0dc78 EFLAGS: 00000246 ORIG_RAX: 0000000000000031
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000000000012 RSI: 0000000020000240 RDI: 0000000000000004
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fafccd0e6d4
R13: 00000000004bdf8b R14: 00000000004ce4b8 R15: 00000000ffffffff
Sending NMI from CPU 0 to CPUs 1:
NMI backtrace for cpu 1
CPU: 1 PID: 8752 Comm: syz-executor4 Not tainted 5.0.0-rc4+ #51
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
RIP: 0010:__x25_find_socket+0x78/0x120 net/x25/af_x25.c:328
Code: 89 f8 48 c1 e8 03 80 3c 18 00 0f 85 a6 00 00 00 4d 8b 64 24 68 4d 85 e4 74 7f e8 03 97 3d fb 49 83 ec 68 74 74 e8 f8 96 3d fb <49> 8d bc 24 88 04 00 00 48 89 f8 48 c1 e8 03 0f b6 04 18 84 c0 74
RSP: 0018:ffff8880639efc58 EFLAGS: 00000246
RAX: 0000000000040000 RBX: dffffc0000000000 RCX: ffffc9000e677000
RDX: 0000000000040000 RSI: ffffffff863244b8 RDI: ffff88806a764628
RBP: ffff8880639efc80 R08: ffff8880a80d05c0 R09: fffffbfff1282775
R10: fffffbfff1282774 R11: ffffffff89413ba3 R12: ffff88806a7645c0
R13: 0000000000000001 R14: ffff88809f29ac00 R15: 0000000000000000
FS:  00007fe8d0c58700(0000) GS:ffff8880ae900000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000001b32823000 CR3: 00000000672eb000 CR4: 00000000001406e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 x25_new_lci net/x25/af_x25.c:357 [inline]
 x25_connect+0x374/0xdf0 net/x25/af_x25.c:786
 __sys_connect+0x266/0x330 net/socket.c:1686
 __do_sys_connect net/socket.c:1697 [inline]
 __se_sys_connect net/socket.c:1694 [inline]
 __x64_sys_connect+0x73/0xb0 net/socket.c:1694
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x457e39
Code: ad b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 7b b8 fb ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe8d0c57c78 EFLAGS: 00000246 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000457e39
RDX: 0000000000000012 RSI: 0000000020000200 RDI: 0000000000000004
RBP: 000000000073bf00 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe8d0c586d4
R13: 00000000004be378 R14: 00000000004ceb00 R15: 00000000ffffffff

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: Andrew Hendry <andrew.hendry@gmail.com>
Cc: linux-x25@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 13:20:14 -08:00
Tristram Ha
cbd72b4852 net: dsa: microchip: add switch offload forwarding support
The flag offload_fwd_mark is set as the switch can forward frames by
itself.

This can be considered a fix to a problem introduced in commit
c2e866911e where the port membership are not set in sync.  The flag
offload_fwd_mark just needs to be set in tag_ksz.c to prevent the software
bridge from forwarding duplicate multicast frames.

Fixes: c2e866911e ("microchip: break KSZ9477 DSA driver into two files")
Signed-off-by: Tristram Ha <Tristram.Ha@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 12:46:51 -08:00
Jouke Witteveen
989723b00b Documentation: bring operstate documentation up-to-date
Netlink has moved from bitmasks to group numbers long ago.

Signed-off-by: Jouke Witteveen <j.witteveen@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-11 12:38:51 -08:00
Davidlohr Bueso
e451eb5106 xsk: share the mmap_sem for page pinning
Holding mmap_sem exclusively for a gup() is an overkill. Lets
share the lock and replace the gup call for gup_longterm(), as
it is better suited for the lifetime of the pinning.

Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Signed-off-by: Davidlohr Bueso <dbueso@suse.de>
Cc: David S. Miller <davem@davemloft.net>
Cc: Bjorn Topel <bjorn.topel@intel.com>
Cc: Magnus Karlsson <magnus.karlsson@intel.com>
CC: netdev@vger.kernel.org
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-11 20:51:09 +01:00
Florian Westphal
8303b7e8f0 netfilter: nat: fix spurious connection timeouts
Sander Eikelenboom bisected a NAT related regression down
to the l4proto->manip_pkt indirection removal.

I forgot that ICMP(v6) errors (e.g. PKTTOOBIG) can be set as related
to the existing conntrack entry.

Therefore, when passing the skb to nf_nat_ipv4/6_manip_pkt(), that
ended up calling the wrong l4 manip function, as tuple->dst.protonum
is the original flows l4 protocol (TCP, UDP, etc).

Set the dst protocol field to ICMP(v6), we already have a private copy
of the tuple due to the inversion of src/dst.

Reported-by: Sander Eikelenboom <linux@eikelenboom.it>
Tested-by: Sander Eikelenboom <linux@eikelenboom.it>
Fixes: faec18dbb0 ("netfilter: nat: remove l4proto->manip_pkt")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-11 17:43:17 +01:00
Jann Horn
c4c07b4d6f netfilter: nf_nat_snmp_basic: add missing length checks in ASN.1 cbs
The generic ASN.1 decoder infrastructure doesn't guarantee that callbacks
will get as much data as they expect; callbacks have to check the `datalen`
parameter before looking at `data`. Make sure that snmp_version() and
snmp_helper() don't read/write beyond the end of the packet data.

(Also move the assignment to `pdata` down below the check to make it clear
that it isn't necessarily a pointer we can use before the `datalen` check.)

Fixes: cc2d58634e ("netfilter: nf_nat_snmp_basic: use asn1 decoder library")
Signed-off-by: Jann Horn <jannh@google.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-11 17:43:17 +01:00
Ilan Peer
6157ca0d6b mac80211: Fix Tx aggregation session tear down with ITXQs
When mac80211 requests the low level driver to stop an ongoing
Tx aggregation, the low level driver is expected to call
ieee80211_stop_tx_ba_cb_irqsafe() to indicate that it is ready
to stop the session. The callback in turn schedules a worker
to complete the session tear down, which in turn also handles
the relevant state for the intermediate Tx queue.

However, as this flow in asynchronous, the intermediate queue
should be stopped and not continue servicing frames, as in
such a case frames that are dequeued would be marked as part
of an aggregation, although the aggregation is already been
stopped.

Fix this by stopping the intermediate Tx queue, before
calling the low level driver to stop the Tx aggregation.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11 15:50:56 +01:00
Johannes Berg
1fc9b72533 cfg80211: prevent speculation on cfg80211_classify8021d() return
It's possible that the caller of cfg80211_classify8021d() uses the
value to index an array, like mac80211 in ieee80211_downgrade_queue().
Prevent speculation on the return value.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11 15:50:56 +01:00
Johannes Berg
ff1bab1ba1 cfg80211: pmsr: record netlink port ID
Without recording the netlink port ID, we cannot return the
results or complete messages to userspace, nor will we be
able to abort if the socket is closed, so clearly we need
to fill the value.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11 15:50:56 +01:00
Aviya Erenfeld
ea18709a6f nl80211: Fix FTM per burst maximum value
Fix FTM per burst maximum value from 15 to 31
(The maximal bits that represents that number in the frame
is 5 hence a maximal value of 31)

Signed-off-by: Aviya Erenfeld <aviya.erenfeld@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11 15:50:56 +01:00
Johannes Berg
4926b51bfa mac80211: call drv_ibss_join() on restart
If a driver does any significant activity in its ibss_join method,
then it will very well expect that to be called during restart,
before any stations are added. Do that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-11 15:50:56 +01:00
Magnus Karlsson
e6762c8bcf xsk: add missing smp_rmb() in xsk_mmap
All the setup code in AF_XDP is protected by a mutex with the
exception of the mmap code that cannot use it. To make sure that a
process banging on the mmap call at the same time as another process
is setting up the socket, smp_wmb() calls were added in the umem
registration code and the queue creation code, so that the published
structures that xsk_mmap needs would be consistent. However, the
corresponding smp_rmb() calls were not added to the xsk_mmap
code. This patch adds these calls.

Fixes: 37b076933a ("xsk: add missing write- and data-dependency barrier")
Fixes: c0c77d8fb7 ("xsk: add user memory registration support sockopt")
Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 20:07:17 -08:00
Willem de Bruijn
b90efd2258 bpf: only adjust gso_size on bytestream protocols
bpf_skb_change_proto and bpf_skb_adjust_room change skb header length.
For GSO packets they adjust gso_size to maintain the same MTU.

The gso size can only be safely adjusted on bytestream protocols.
Commit d02f51cbcf ("bpf: fix bpf_skb_adjust_net/bpf_skb_proto_xlat
to deal with gso sctp skbs") excluded SKB_GSO_SCTP.

Since then type SKB_GSO_UDP_L4 has been added, whose contents are one
gso_size unit per datagram. Also exclude these.

Move from a blacklist to a whitelist check to future proof against
additional such new GSO types, e.g., for fraglist based GRO.

Fixes: bec1f6f697 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:57:34 -08:00
Martin KaFai Lau
655a51e536 bpf: Add struct bpf_tcp_sock and BPF_FUNC_tcp_sock
This patch adds a helper function BPF_FUNC_tcp_sock and it
is currently available for cg_skb and sched_(cls|act):

struct bpf_tcp_sock *bpf_tcp_sock(struct bpf_sock *sk);

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_tcp_sock *tp;
	struct bpf_sock *sk;
	__u32 snd_cwnd;

	sk = skb->sk;
	if (!sk)
		return 1;

	tp = bpf_tcp_sock(sk);
	if (!tp)
		return 1;

	snd_cwnd = tp->snd_cwnd;
	/* ... */

	return 1;
}

A 'struct bpf_tcp_sock' is also added to the uapi bpf.h to provide
read-only access.  bpf_tcp_sock has all the existing tcp_sock's fields
that has already been exposed by the bpf_sock_ops.
i.e. no new tcp_sock's fields are exposed in bpf.h.

This helper returns a pointer to the tcp_sock.  If it is not a tcp_sock
or it cannot be traced back to a tcp_sock by sk_to_full_sk(), it
returns NULL.  Hence, the caller needs to check for NULL before
accessing it.

The current use case is to expose members from tcp_sock
to allow a cg_skb_bpf_prog to provide per cgroup traffic
policing/shaping.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:46:17 -08:00
Martin KaFai Lau
9b1f3d6e5a bpf: Refactor sock_ops_convert_ctx_access
The next patch will introduce a new "struct bpf_tcp_sock" which
exposes the same tcp_sock's fields already exposed in
"struct bpf_sock_ops".

This patch refactor the existing convert_ctx_access() codes for
"struct bpf_sock_ops" to get them ready to be reused for
"struct bpf_tcp_sock".  The "rtt_min" is not refactored
in this patch because its handling is different from other
fields.

The SOCK_OPS_GET_TCP_SOCK_FIELD is new. All other SOCK_OPS_XXX_FIELD
changes are code move only.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:46:17 -08:00
Martin KaFai Lau
aa65d6960a bpf: Add state, dst_ip4, dst_ip6 and dst_port to bpf_sock
This patch adds "state", "dst_ip4", "dst_ip6" and "dst_port" to the
bpf_sock.  The userspace has already been using "state",
e.g. inet_diag (ss -t) and getsockopt(TCP_INFO).

This patch also allows narrow load on the following existing fields:
"family", "type", "protocol" and "src_port".  Unlike IP address,
the load offset is resticted to the first byte for them but it
can be relaxed later if there is a use case.

This patch also folds __sock_filter_check_size() into
bpf_sock_is_valid_access() since it is not called
by any where else.  All bpf_sock checking is in
one place.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:46:17 -08:00
Martin KaFai Lau
46f8bc9275 bpf: Add a bpf_sock pointer to __sk_buff and a bpf_sk_fullsock helper
In kernel, it is common to check "skb->sk && sk_fullsock(skb->sk)"
before accessing the fields in sock.  For example, in __netdev_pick_tx:

static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb,
			    struct net_device *sb_dev)
{
	/* ... */

	struct sock *sk = skb->sk;

		if (queue_index != new_index && sk &&
		    sk_fullsock(sk) &&
		    rcu_access_pointer(sk->sk_dst_cache))
			sk_tx_queue_set(sk, new_index);

	/* ... */

	return queue_index;
}

This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff"
where a few of the convert_ctx_access() in filter.c has already been
accessing the skb->sk sock_common's fields,
e.g. sock_ops_convert_ctx_access().

"__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier.
Some of the fileds in "bpf_sock" will not be directly
accessible through the "__sk_buff->sk" pointer.  It is limited
by the new "bpf_sock_common_is_valid_access()".
e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock
     are not allowed.

The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)"
can be used to get a sk with all accessible fields in "bpf_sock".
This helper is added to both cg_skb and sched_(cls|act).

int cg_skb_foo(struct __sk_buff *skb) {
	struct bpf_sock *sk;

	sk = skb->sk;
	if (!sk)
		return 1;

	sk = bpf_sk_fullsock(sk);
	if (!sk)
		return 1;

	if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP)
		return 1;

	/* some_traffic_shaping(); */

	return 1;
}

(1) The sk is read only

(2) There is no new "struct bpf_sock_common" introduced.

(3) Future kernel sock's members could be added to bpf_sock only
    instead of repeatedly adding at multiple places like currently
    in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc.

(4) After "sk = skb->sk", the reg holding sk is in type
    PTR_TO_SOCK_COMMON_OR_NULL.

(5) After bpf_sk_fullsock(), the return type will be in type
    PTR_TO_SOCKET_OR_NULL which is the same as the return type of
    bpf_sk_lookup_xxx().

    However, bpf_sk_fullsock() does not take refcnt.  The
    acquire_reference_state() is only depending on the return type now.
    To avoid it, a new is_acquire_function() is checked before calling
    acquire_reference_state().

(6) The WARN_ON in "release_reference_state()" is no longer an
    internal verifier bug.

    When reg->id is not found in state->refs[], it means the
    bpf_prog does something wrong like
    "bpf_sk_release(bpf_sk_fullsock(skb->sk))" where reference has
    never been acquired by calling "bpf_sk_fullsock(skb->sk)".

    A -EINVAL and a verbose are done instead of WARN_ON.  A test is
    added to the test_verifier in a later patch.

    Since the WARN_ON in "release_reference_state()" is no longer
    needed, "__release_reference_state()" is folded into
    "release_reference_state()" also.

Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2019-02-10 19:46:17 -08:00
Eli Cohen
eddd2cf195 net: Change TCA_ACT_* to TCA_ID_* to match that of TCA_ID_POLICE
Modify the kernel users of the TCA_ACT_* macros to use TCA_ID_*. For
example, use TCA_ID_GACT instead of TCA_ACT_GACT. This will align with
TCA_ID_POLICE and also differentiates these identifier, used in struct
tc_action_ops type field, from other macros starting with TCA_ACT_.

To make things clearer, we name the enum defining the TCA_ID_*
identifiers and also change the "type" field of struct tc_action to
id.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-10 09:28:43 -08:00
Eli Cohen
257eeded20 net: Move all TC actions identifiers to one place
Move all the TC identifiers to one place, to the same enum that defines
the identifier of police action. This makes it easier choose numbers for
new actions since they are now defined in one place. We preserve the
original values for binary compatibility. New IDs should be added inside
the enum.

Signed-off-by: Eli Cohen <eli@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-10 09:28:43 -08:00
Vakul Garg
8497ded2d1 net/tls: Disable async decrytion for tls1.3
Function tls_sw_recvmsg() dequeues multiple records from stream parser
and decrypts them. In case the decryption is done by async accelerator,
the records may get submitted for decryption while the previous ones may
not have been decryted yet. For tls1.3, the record type is known only
after decryption. Therefore, for tls1.3, tls_sw_recvmsg() may submit
records for decryption even if it gets 'handshake' records after 'data'
records. These intermediate 'handshake' records may do a key updation.
By the time new keys are given to ktls by userspace, it is possible that
ktls has already submitted some records i(which are encrypted with new
keys) for decryption using old keys. This would lead to decrypt failure.
Therefore, async decryption of records should be disabled for tls1.3.

Fixes: 130b392c6c ("net: tls: Add tls 1.3 support")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-09 09:27:49 -08:00
Sven Eckelmann
7e6f461efe batman-adv: Trigger genl notification on sysfs config change
The generic netlink code is expected to trigger notification messages when
configuration might have been changed. But the configuration of batman-adv
is most of the time still done using sysfs. So the sysfs interface should
also trigger the corresponding netlink messages via the "config" multicast
group.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:15 +01:00
Sven Eckelmann
9a182242f1 batman-adv: Add throughput_override hardif genl configuration
The B.A.T.M.A.N. V implementation tries to estimate the link throughput of
an interface to an originator using different automatic methods. It is
still possible to overwrite it the link throughput for all reachable
originators via this interface.

The BATADV_CMD_SET_HARDIF/BATADV_CMD_GET_HARDIF commands allow to set/get
the configuration of this feature using the u32
BATADV_ATTR_THROUGHPUT_OVERRIDE attribute. The used unit is in 100 Kbit/s.
If the value is set to 0 then batman-adv will try to estimate the
throughput by itself.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:15 +01:00
Sven Eckelmann
a108008290 batman-adv: Add elp_interval hardif genl configuration
The ELP packets are transmitted every elp_interval milliseconds on an
slave/hard-interface. This value can be changed using the configuration
interface.

The BATADV_CMD_SET_HARDIF/BATADV_CMD_GET_HARDIF commands allow to set/get
the configuration of this feature using the u32 BATADV_ATTR_ELP_INTERVAL
attribute.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:15 +01:00
Sven Eckelmann
7b751b39f0 batman-adv: Add orig_interval mesh genl configuration
The OGM packets are transmitted every orig_interval milliseconds. This
value can be changed using the configuration interface.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the u32 BATADV_ATTR_ORIG_INTERVAL
attribute.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:15 +01:00
Sven Eckelmann
6c57cde680 batman-adv: Add network_coding mesh genl configuration
The mesh interface can use (in an homogeneous mesh) network coding, a
mechanism that aims to increase the overall network throughput by fusing
multiple packets in one transmission.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the BATADV_ATTR_NETWORK_CODING_ENABLED
attribute. Setting the u8 to zero will disable this feature and setting it
to something else is enabling this feature.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:15 +01:00
Sven Eckelmann
f75b56bc91 batman-adv: Add multicast forceflood mesh genl configuration
The mesh interface can optimize the flooding of multicast packets based on
the content of the global translation tables. To disable this behavior and
use the broadcast-like flooding of the packets, forceflood has to be
enabled.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the
BATADV_ATTR_MULTICAST_FORCEFLOOD_ENABLED attribute. Setting the u8 to zero
will disable this feature (allowing multicast optimizations) and setting it
to something else is enabling this feature (forcing simple flooding).

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
b85bd09109 batman-adv: Add log_level mesh genl configuration
In contrast to other modules, batman-adv allows to set the debug message
verbosity per mesh/soft-interface and not per module (via modparam).

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the u32 (bitmask) BATADV_ATTR_LOG_LEVEL
attribute.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
bfc7f1be57 batman-adv: Add hop_penalty mesh genl configuration
The TQ (B.A.T.M.A.N. IV) and throughput values (B.A.T.M.A.N. V) are reduced
when they are forwarded. One of the reductions is the penalty for
traversing an additional hop. This hop_penalty (0-255) defines the
percentage of reduction (0-100%).

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the u8 BATADV_ATTR_HOP_PENALTY
attribute.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
e2d0d35b5b batman-adv: Add gateway mesh genl configuration
The mesh/soft-interface can optimize the handling of DHCP packets. Instead
of flooding them through the whole mesh, it can be forwarded as unicast to
a specific gateway server. The originator which injects the packets in the
mesh has to select (based on sel_class thresholds) a responsible gateway
server. This is done by switching this originator to the gw_mode client.
The servers announce their forwarding bandwidth (download/upload) when the
gw_mode server was selected.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the attributes:

* u8 BATADV_ATTR_GW_MODE (0 == off, 1 == client, 2 == server)
* u32 BATADV_ATTR_GW_BANDWIDTH_DOWN (in 100 kbit/s steps)
* u32 BATADV_ATTR_GW_BANDWIDTH_UP (in 100 kbit/s steps)
* u32 BATADV_ATTR_GW_SEL_CLASS

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
3e15b06eb7 batman-adv: Add fragmentation mesh genl configuration
The mesh interface can fragment unicast packets when the packet size
exceeds the outgoing slave/hard-interface MTU.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the BATADV_ATTR_FRAGMENTATION_ENABLED
attribute. Setting the u8 to zero will disable this feature and setting it
to something else is enabling this feature.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
a1c8de8032 batman-adv: Add distributed_arp_table mesh genl configuration
The mesh interface can use a distributed hash table to answer ARP requests
without flooding the request through the whole mesh.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the
BATADV_ATTR_DISTRIBUTED_ARP_TABLE_ENABLED attribute. Setting the u8 to zero
will disable this feature and setting it to something else is enabling this
feature.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
43ff6105a5 batman-adv: Add bridge_loop_avoidance mesh genl configuration
The mesh interface can try to detect loops in the same mesh caused by
(indirectly) bridged mesh/soft-interfaces of different nodes. Some of the
loops can also be resolved without breaking the mesh.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the
BATADV_ATTR_BRIDGE_LOOP_AVOIDANCE_ENABLED attribute. Setting the u8 to zero
will disable this feature and setting it to something else is enabling this
feature.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:14 +01:00
Sven Eckelmann
d7e52506b6 batman-adv: Add bonding mesh genl configuration
The mesh interface can use multiple slave/hard-interface ports at the same
time to transport the traffic to other nodes.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the BATADV_ATTR_BONDING_ENABLED
attribute. Setting the u8 to zero will disable this feature and setting it
to something else is enabling this feature.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Sven Eckelmann
e43d16b87d batman-adv: Add ap_isolation mesh/vlan genl configuration
The mesh interface can drop messages between clients to implement a
mesh-wide AP isolation.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH and
BATADV_CMD_SET_VLAN/BATADV_CMD_GET_VLAN commands allow to set/get the
configuration of this feature using the BATADV_ATTR_AP_ISOLATION_ENABLED
attribute. Setting the u8 to zero will disable this feature and setting it
to something else is enabling this feature.

This feature also requires that skbuff which should be handled as isolated
are marked. The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to
set/get the mark/mask using the u32 attributes BATADV_ATTR_ISOLATION_MARK
and BATADV_ATTR_ISOLATION_MASK.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Sven Eckelmann
9ab4cee5ce batman-adv: Add aggregated_ogms mesh genl configuration
The mesh interface can delay OGM messages to aggregate different ogms
together in a single OGM packet.

The BATADV_CMD_SET_MESH/BATADV_CMD_GET_MESH commands allow to set/get the
configuration of this feature using the BATADV_ATTR_AGGREGATED_OGMS_ENABLED
attribute. Setting the u8 to zero will disable this feature and setting it
to something else is enabling this feature.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Sven Eckelmann
49e7e37cd9 batman-adv: Prepare framework for vlan genl config
The batman-adv configuration interface was implemented solely using sysfs.
This approach was condemned by non-batadv developers as "huge mistake".
Instead a netlink/genl based implementation was suggested.

Beside the mesh/soft-interface specific configuration, the VLANs on top of
the mesh/soft-interface have configuration settings. The genl interface
reflects this by allowing to get/set it using the vlan specific commands
BATADV_CMD_GET_VLAN/BATADV_CMD_SET_VLAN.

The set command BATADV_CMD_SET_MESH will also notify interested userspace
listeners of the "config" mcast group using the BATADV_CMD_SET_VLAN command
message type that settings might have been changed and what the current
values are.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Sven Eckelmann
5c55a40fa8 batman-adv: Prepare framework for hardif genl config
The batman-adv configuration interface was implemented solely using sysfs.
This approach was condemned by non-batadv developers as "huge mistake".
Instead a netlink/genl based implementation was suggested.

Beside the mesh/soft-interface specific configuration, the
slave/hard-interface have B.A.T.M.A.N. V specific configuration settings.
The genl interface reflects this by allowing to get/set it using the
hard-interface specific commands.

The BATADV_CMD_GET_HARDIFS (or short version BATADV_CMD_GET_HARDIF) is
reused as get command because it already allow sto dump the content of
other information from the slave/hard-interface which are not yet
configuration specific.

The set command BATADV_CMD_SET_HARDIF will also notify interested userspace
listeners of the "config" mcast group using the BATADV_CMD_SET_HARDIF
command message type that settings might have been changed and what the
current values are.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Sven Eckelmann
6004051353 batman-adv: Prepare framework for mesh genl config
The batman-adv configuration interface was implemented solely using sysfs.
This approach was condemned by non-batadv developers as "huge mistake".
Instead a netlink/genl based implementation was suggested.

The main objects for this configuration is the mesh/soft-interface object.
Its actual object in memory already contains most of the available
configuration settings. The genl interface reflects this by allowing to
get/set it using the mesh specific commands.

The BATADV_CMD_GET_MESH_INFO (or short version BATADV_CMD_GET_MESH) is
reused as get command because it already provides the content of other
information from the mesh/soft-interface which are not yet configuration
specific.

The set command BATADV_CMD_SET_MESH will also notify interested userspace
listeners of the "config" mcast group using the BATADV_CMD_SET_MESH command
message type that settings might have been changed and what the current
values are.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Sven Eckelmann
c4a7a8d9bb batman-adv: Move common genl doit code pre/post hooks
The commit ff4c92d85c ("genetlink: introduce pre_doit/post_doit hooks")
intoduced a mechanism to run specific code for doit hooks before/after the
hooks are run. Since all doit hooks are requiring the batadv softif, it
should be retrieved/freed in these helpers to simplify the code.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-09 14:28:13 +01:00
Nathan Chancellor
8b34ec65b3 ethtool: Remove unnecessary null check in ethtool_rx_flow_rule_create
net/core/ethtool.c:3023:19: warning: address of array
'ext_m_spec->h_dest' will always evaluate to 'true'
[-Wpointer-bool-conversion]
                if (ext_m_spec->h_dest) {
                ~~  ~~~~~~~~~~~~^~~~~~

h_dest is an array, it can't be null so remove this check.

Fixes: eca4205f9e ("ethtool: add ethtool_rx_flow_spec to flow_rule structure translator")
Link: https://github.com/ClangBuiltLinux/linux/issues/353
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 23:05:18 -08:00
Gustavo A. R. Silva
33b363e004 net: dsa: use struct_size() in devm_kzalloc()
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = alloc(struct_size(instance, entry, count), GFP_KERNEL)

Notice that, in this case, variable size is not necessary, hence it is
removed.

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 22:57:27 -08:00
Gustavo A. R. Silva
b4ba9354cc mpls_iptunnel: use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

instance = alloc(sizeof(struct foo) + count * sizeof(struct boo));

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

instance = alloc(struct_size(instance, entry, count));

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 22:57:27 -08:00
Gustavo A. R. Silva
8fe5756c73 net/sched: use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);
instance = alloc(size, GFP_KERNEL)

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 22:57:27 -08:00
Gustavo A. R. Silva
4154b567b6 bridge: use struct_size() helper
One of the more common cases of allocation size calculations is finding
the size of a structure that has a zero-sized array at the end, along
with memory for some number of elements for that array. For example:

struct foo {
    int stuff;
    struct boo entry[];
};

size = sizeof(struct foo) + count * sizeof(struct boo);
instance = alloc(size, GFP_KERNEL)

Instead of leaving these open-coded and prone to type mistakes, we can
now use the new struct_size() helper:

size = struct_size(instance, entry, count);

This code was detected with the help of Coccinelle.

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 22:57:27 -08:00
Ursula Braun
ccc8ca9b90 net/smc: fix byte_order for rx_curs_confirmed
The recent change in the rx_curs_confirmed assignment disregards
byte order, which causes problems on little endian architectures.
This patch fixes it.

Fixes: b8649efad8 ("net/smc: fix sender_free computation") (net-tree)
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 22:33:25 -08:00
Paolo Abeni
225d946426 vsock: cope with memory allocation failure at socket creation time
In the unlikely event that the kmalloc call in vmci_transport_socket_init()
fails, we end-up calling vmci_transport_destruct() with a NULL vmci_trans()
and oopsing.

This change addresses the above explicitly checking for zero vmci_trans()
at destruction time.

Reported-by: Xiumei Mu <xmu@redhat.com>
Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 22:32:05 -08:00
Lorenzo Bianconi
c09551c6ff net: ipv4: use a dedicated counter for icmp_v4 redirect packets
According to the algorithm described in the comment block at the
beginning of ip_rt_send_redirect, the host should try to send
'ip_rt_redirect_number' ICMP redirect packets with an exponential
backoff and then stop sending them at all assuming that the destination
ignores redirects.
If the device has previously sent some ICMP error packets that are
rate-limited (e.g TTL expired) and continues to receive traffic,
the redirect packets will never be transmitted. This happens since
peer->rate_tokens will be typically greater than 'ip_rt_redirect_number'
and so it will never be reset even if the redirect silence timeout
(ip_rt_redirect_silence) has elapsed without receiving any packet
requiring redirects.

Fix it by using a dedicated counter for the number of ICMP redirect
packets that has been sent by the host

I have not been able to identify a given commit that introduced the
issue since ip_rt_send_redirect implements the same rate-limiting
algorithm from commit 1da177e4c3 ("Linux-2.6.12-rc2")

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 21:50:15 -08:00
Jiri Pirko
7c62cfb8c5 devlink: publish params only after driver init is done
Currently, user can do dump or get of param values right after the
devlink params are registered. However the driver may not be initialized
which is an issue. The same problem happens during notification
upon param registration. Allow driver to publish devlink params
whenever it is ready to handle get() ops. Note that this cannot
be resolved by init reordering, as the "driverinit" params have
to be available before the driver is initialized (it needs the param
values there).

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Cc: Michael Chan <michael.chan@broadcom.com>
Cc: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:02:49 -08:00
David S. Miller
a655fe9f19 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
An ipvlan bug fix in 'net' conflicted with the abstraction away
of the IPV6 specific support in 'net-next'.

Similarly, a bug fix for mlx5 in 'net' conflicted with the flow
action conversion in 'net-next'.

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-08 15:00:17 -08:00
Linus Torvalds
27b4ad621e Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:
 "This pull request is dedicated to the upcoming snowpocalypse parts 2
  and 3 in the Pacific Northwest:

   1) Drop profiles are broken because some drivers use dev_kfree_skb*
      instead of dev_consume_skb*, from Yang Wei.

   2) Fix IWLWIFI kconfig deps, from Luca Coelho.

   3) Fix percpu maps updating in bpftool, from Paolo Abeni.

   4) Missing station release in batman-adv, from Felix Fietkau.

   5) Fix some networking compat ioctl bugs, from Johannes Berg.

   6) ucc_geth must reset the BQL queue state when stopping the device,
      from Mathias Thore.

   7) Several XDP bug fixes in virtio_net from Toshiaki Makita.

   8) TSO packets must be sent always on queue 0 in stmmac, from Jose
      Abreu.

   9) Fix socket refcounting bug in RDS, from Eric Dumazet.

  10) Handle sparse cpu allocations in bpf selftests, from Martynas
      Pumputis.

  11) Make sure mgmt frames have enough tailroom in mac80211, from Felix
      Feitkau.

  12) Use safe list walking in sctp_sendmsg() asoc list traversal, from
      Greg Kroah-Hartman.

  13) Make DCCP's ccid_hc_[rt]x_parse_options always check for NULL
      ccid, from Eric Dumazet.

  14) Need to reload WoL password into bcmsysport device after deep
      sleeps, from Florian Fainelli.

  15) Remove filter from mask before freeing in cls_flower, from Petr
      Machata.

  16) Missing release and use after free in error paths of s390 qeth
      code, from Julian Wiedmann.

  17) Fix lockdep false positive in dsa code, from Marc Zyngier.

  18) Fix counting of ATU violations in mv88e6xxx, from Andrew Lunn.

  19) Fix EQ firmware assert in qed driver, from Manish Chopra.

  20) Don't default Caivum PTP to Y in kconfig, from Bjorn Helgaas"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  net: dsa: b53: Fix for failure when irq is not defined in dt
  sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
  geneve: should not call rt6_lookup() when ipv6 was disabled
  net: Don't default Cavium PTP driver to 'y'
  net: broadcom: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: via-velocity: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: tehuti: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: sun: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: fsl_ucc_hdlc: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: fec_mpc52xx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: smsc: epic100: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: dscc4: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: tulip: de2104x: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net: defxx: replace dev_kfree_skb_irq by dev_consume_skb_irq for drop profiles
  net/mlx5e: Don't overwrite pedit action when multiple pedit used
  net/mlx5e: Update hw flows when encap source mac changed
  qed*: Advance drivers version to 8.37.0.20
  qed: Change verbosity for coalescing message.
  qede: Fix system crash on configuring channels.
  qed: Consider TX tcs while deriving the max num_queues for PF.
  ...
2019-02-08 11:21:54 -08:00
Karsten Graul
232dc8ef64 net/smc: original socket family in inet_sock_diag
Commit ed75986f4a ("net/smc: ipv6 support for smc_diag.c") changed the
value of the diag_family field. The idea was to indicate the family of
the IP address in the inet_diag_sockid field. But the change makes it
impossible to distinguish an inet_sock_diag response message from SMC
sock_diag response. This patch restores the original behaviour and sends
AF_SMC as value of the diag_family field.

Fixes: ed75986f4a ("net/smc: ipv6 support for smc_diag.c")
Reported-by: Eugene Syromiatnikov <esyr@redhat.com>
Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 18:06:19 -08:00
Karsten Graul
8fc002b01a net/smc: move code to clear the conn->lgr field
The lgr field of an smc_connection is set in smc_conn_create() and
should be cleared in smc_conn_free() for consistency reasons, so move
the responsible code.

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 18:06:19 -08:00
Hans Wippel
72a36a8aec net/smc: use client and server LGR pending locks for SMC-R
If SMC client and server connections are both established at the same
time, smc_connect_rdma() cannot send a CLC confirm message while
smc_listen_work() is waiting for one due to lock contention. This can
result in timeouts in smc_clc_wait_msg() and failed SMC connections.

In case of SMC-R, there are two types of LGRs (client and server LGRs)
which can be protected by separate locks. So, this patch splits the LGR
pending lock into two separate locks for client and server to avoid the
locking issue for SMC-R.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 18:06:18 -08:00
Hans Wippel
62c7139f3e net/smc: unlock LGR pending lock earlier for SMC-D
If SMC client and server connections are both established at the same
time, smc_connect_ism() cannot send a CLC confirm message while
smc_listen_work() is waiting for one due to lock contention. This can
result in timeouts in smc_clc_wait_msg() and failed SMC connections.

In case of SMC-D, the LGR pending lock is not needed while
smc_listen_work() is waiting for the CLC confirm message. So, this patch
releases the lock earlier for SMC-D to avoid the locking issue.

Signed-off-by: Hans Wippel <hwippel@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 18:06:18 -08:00
Ursula Braun
a225d2cd88 net/smc: use smc_curs_copy() for SMC-D
SMC already provides a wrapper for atomic64 calls to be
architecture independent. Use this wrapper for SMC-D as well.

Reported-by: Jens Remus <jremus@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 18:06:18 -08:00
Ursula Braun
b03faa1faf net/smc: postpone release of clcsock
According to RFC7609 (http://www.rfc-editor.org/info/rfc7609)
first the SMC-R connection is shut down and then the normal TCP
connection FIN processing drives cleanup of the internal TCP connection.
The unconditional release of the clcsock during active socket closing
has to be postponed if the peer has not yet signalled socket closing.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 18:06:18 -08:00
Hangbin Liu
173656acca sit: check if IPv6 enabled before calling ip6_err_gen_icmpv6_unreach()
If we disabled IPv6 from the kernel command line (ipv6.disable=1), we should
not call ip6_err_gen_icmpv6_unreach(). This:

  ip link add sit1 type sit local 192.0.2.1 remote 192.0.2.2 ttl 1
  ip link set sit1 up
  ip addr add 198.51.100.1/24 dev sit1
  ping 198.51.100.2

if IPv6 is disabled at boot time, will crash the kernel.

v2: there's no need to use in6_dev_get(), use __in6_dev_get() instead,
    as we only need to check that idev exists and we are under
    rcu_read_lock() (from netif_receive_skb_internal()).

Reported-by: Jianlin Shi <jishi@redhat.com>
Fixes: ca15a078bd ("sit: generate icmpv6 error when receiving icmpv4 error")
Cc: Oussama Ghorbel <ghorbel@pivasoftware.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: Stefano Brivio <sbrivio@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:48:42 -08:00
Eran Ben Elisha
35455e23e6 devlink: Add health dump {get,clear} commands
Add devlink health dump commands, in order to run an dump operation
over a specific reporter.

The supported operations are dump_get in order to get last saved
dump (if not exist, dump now) and dump_clear to clear last saved
dump.

It is expected from driver's callback for diagnose command to fill it
via the devlink fmsg API. Devlink will parse it and convert it to
netlink nla API in order to pass it to the user.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:29 -08:00
Eran Ben Elisha
fca42a2794 devlink: Add health diagnose command
Add devlink health diagnose command, in order to run a diagnose
operation over a specific reporter.

It is expected from driver's callback for diagnose command to fill it
via the devlink fmsg API. Devlink will parse it and convert it to
netlink nla API in order to pass it to the user.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
20a0943a5b devlink: Add health recover command
Add devlink health recover command to the uapi, in order to allow the user
to execute a recover operation over a specific reporter.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
a1e55ec0a0 devlink: Add health set command
Add devlink health set command, in order to set configuration parameters
for a specific reporter.
Supported parameters are:
- graceful_period: Time interval between auto recoveries (in msec)
- auto_recover: Determines if the devlink shall execute recover upon
		receiving error for the reporter

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
7afe335a8b devlink: Add health get command
Add devlink health get command to provide reporter/s data for user space.
Add the ability to get data per reporter or dump data from all available
reporters.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
c8e1da0bf9 devlink: Add health report functionality
Upon error discover, every driver can report it to the devlink health
mechanism via devlink_health_report function, using the appropriate
reporter registered to it. Driver can pass error specific context which
will be delivered to it as part of the dump / recovery callbacks.

Once an error is reported, devlink health will do the following actions:
* A log is being send to the kernel trace events buffer
* Health status and statistics are being updated for the reporter instance
* Object dump is being taken and stored at the reporter instance (as long
  as there is no other dump which is already stored)
* Auto recovery attempt is being done. Depends on:
  - Auto Recovery configuration
  - Grace period vs. Time since last recover

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
a0bdcc59d1 devlink: Add health reporter create/destroy functionality
Devlink health reporter is an instance for reporting, diagnosing and
recovering from run time errors discovered by the reporters.
Define it's data structure and supported operations.
In addition, expose devlink API to create and destroy a reporter.
Each devlink instance will hold it's own reporters list.

As part of the allocation, driver shall provide a set of callbacks which
will be used by devlink in order to handle health reports and user
commands related to this reporter. In addition, driver is entitled to
provide some priv pointer, which can be fetched from the reporter by
devlink_health_reporter_priv function.

For each reporter, devlink will hold a metadata of statistics,
dump msg and status.

For passing dumps and diagnose data to the user-space, it will use devlink
fmsg API.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
Eran Ben Elisha
1db64e8733 devlink: Add devlink formatted message (fmsg) API
Devlink fmsg is a mechanism to pass descriptors between drivers and
devlink, in json-like format. The API allows the driver to add nested
attributes such as object, object pair and value array, in addition to
attributes such as name and value.

Driver can use this API to fill the fmsg context in a format which will be
translated by the devlink to the netlink message later.
There is no memory allocation in advance (other than the initial list
head), and it dynamically allocates messages descriptors and add them to
the list on the fly.

When it needs to send the data using SKBs to the netlink layer, it
fragments the data between different SKBs. In order to do this
fragmentation, it uses virtual nests attributes, to avoid actual
nesting use which cannot be divided between different SKBs.

Signed-off-by: Eran Ben Elisha <eranbe@mellanox.com>
Reviewed-by: Moshe Shemesh <moshe@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-07 10:34:28 -08:00
David S. Miller
042a41977b Merge branch 'for_net-next-5.1/rds-tos-v4' of git://git.kernel.org/pub/scm/linux/kernel/git/ssantosh/linux
Santosh Shilimkar says:

====================
rds: add tos support

RDS applications make use of tos to classify database traffic.
This feature has been used in shipping products from 2.6.32 based
kernels. Its tied with RDS v4.1 protocol version and the compatibility
gets negotiated as part of connections setup.

Patchset keeps full backward compatibility using existing connection
negotiation scheme. Currently the feature is exploited by RDMA
transport and for TCP transport the user tos values are mapped to
same default class (0).

For RDMA transports, RDMA CM service type API is used to
set up different SL(service lanes) and the IB fabric is configured
for tos mapping using Subnet Manager(SL to VL mappings).
Similarly for ROCE fabric, user priority is mapped with different
DSCP code points which are associated with different switch queues
in the fabric.

The original code was developed by Bang Nguyen in downstream kernel back in
2.6.32 kernel days and it has evolved significantly over period of time.

Thanks to Yanjun for doing testing with various combinations of host like
v3.1<->v4.1, v4.1.<->v3.1, v4.1 upstream to shipping v4.1 etc etc
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 17:00:15 -08:00
David S. Miller
e90b1fd83c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2019-02-07

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add a riscv64 JIT for BPF, from Björn.

2) Implement BTF deduplication algorithm for libbpf which takes BTF type
   information containing duplicate per-compilation unit information and
   reduces it to an equivalent set of BTF types with no duplication and
   without loss of information, from Andrii.

3) Offloaded and native BPF XDP programs can coexist today, enable also
   offloaded and generic ones as well, from Jakub.

4) Expose various BTF related helper functions in libbpf as API which
   are in particular helpful for JITed programs, from Yonghong.

5) Fix the recently added JMP32 code emission in s390x JIT, from Heiko.

6) Fix BPF kselftests' tcp_{server,client}.py to be able to run inside
   a network namespace, also add a fix for libbpf to get libbpf_print()
   working, from Stanislav.

7) Fixes for bpftool documentation, from Prashant.

8) Type cleanup in BPF kselftests' test_maps.c to silence a gcc8 warning,
   from Breno.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 16:56:20 -08:00
Florian Fainelli
bccb30254a net: Get rid of SWITCHDEV_ATTR_ID_PORT_PARENT_ID
Now that we have a dedicated NDO for getting a port's parent ID, get rid
of SWITCHDEV_ATTR_ID_PORT_PARENT_ID and convert all callers to use the
NDO exclusively. This is a preliminary change to getting rid of
switchdev_ops eventually.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 14:17:03 -08:00
Florian Fainelli
929d6c145e net: dsa: Implement ndo_get_port_parent_id()
DSA implements SWITCHDEV_ATTR_ID_PORT_PARENT_ID and we want to get rid
of switchdev_ops eventually, ease that migration by implementing a
ndo_get_port_parent_id() function which returns what
switchdev_port_attr_get() would do.

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 14:17:03 -08:00
Florian Fainelli
d6abc59694 net: Introduce ndo_get_port_parent_id()
In preparation for getting rid of switchdev_ops, create a dedicated NDO
operation for getting the port's parent identifier. There are
essentially two classes of drivers that need to implement getting the
port's parent ID which are VF/PF drivers with a built-in switch, and
pure switchdev drivers such as mlxsw, ocelot, dsa etc.

We introduce a helper function: dev_get_port_parent_id() which supports
recursion into the lower devices to obtain the first port's parent ID.

Convert the bridge, core and ipv4 multicast routing code to check for
such ndo_get_port_parent_id() and call the helper function when valid
before falling back to switchdev_port_attr_get(). This will allow us to
convert all relevant drivers in one go instead of having to implement
both switchdev_port_attr_get() and ndo_get_port_parent_id() operations,
then get rid of switchdev_port_attr_get().

Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 14:16:11 -08:00
Dan Carpenter
00670cb8a7 net: dsa: Fix NULL checking in dsa_slave_set_eee()
This function can't succeed if dp->pl is NULL.  It will Oops inside the
call to return phylink_ethtool_get_eee(dp->pl, e);

Fixes: 1be52e97ed ("dsa: slave: eee: Allow ports to use phylink")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 13:42:54 -08:00
Chuck Lever
e248aa7be8 svcrdma: Remove max_sge check at connect time
Two and a half years ago, the client was changed to use gathered
Send for larger inline messages, in commit 655fec6987 ("xprtrdma:
Use gathered Send for large inline messages"). Several fixes were
required because there are a few in-kernel device drivers whose
max_sge is 3, and these were broken by the change.

Apparently my memory is going, because some time later, I submitted
commit 25fd86eca1 ("svcrdma: Don't overrun the SGE array in
svc_rdma_send_ctxt"), and after that, commit f3c1fd0ee2 ("svcrdma:
Reduce max_send_sges"). These too incorrectly assumed in-kernel
device drivers would have more than a few Send SGEs available.

The fix for the server side is not the same. This is because the
fundamental problem on the server is that, whether or not the client
has provisioned a chunk for the RPC reply, the server must squeeze
even the most complex RPC replies into a single RDMA Send. Failing
in the send path because of Send SGE exhaustion should never be an
option.

Therefore, instead of failing when the send path runs out of SGEs,
switch to using a bounce buffer mechanism to handle RPC replies that
are too complex for the device to send directly. That allows us to
remove the max_sge check to enable drivers with small max_sge to
work again.

Reported-by: Don Dutile <ddutile@redhat.com>
Fixes: 25fd86eca1 ("svcrdma: Don't overrun the SGE array in ...")
Cc: stable@vger.kernel.org
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2019-02-06 15:32:34 -05:00
Nir Dotan
57186a5f43 devlink: add hardware errors tracing facility
Define a tracepoint and allow user to trace messages in case of an hardware
error code for hardware associated with devlink instance.

Signed-off-by: Nir Dotan <nird@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 11:05:57 -08:00
Eric Dumazet
6dce3c20ac rxrpc: bad unlock balance in rxrpc_recvmsg
When either "goto wait_interrupted;" or "goto wait_error;"
paths are taken, socket lock has already been released.

This patch fixes following syzbot splat :

WARNING: bad unlock balance detected!
5.0.0-rc4+ #59 Not tainted
-------------------------------------
syz-executor223/8256 is trying to release lock (sk_lock-AF_RXRPC) at:
[<ffffffff86651353>] rxrpc_recvmsg+0x6d3/0x3099 net/rxrpc/recvmsg.c:598
but there are no more locks to release!

other info that might help us debug this:
1 lock held by syz-executor223/8256:
 #0: 00000000fa9ed0f4 (slock-AF_RXRPC){+...}, at: spin_lock_bh include/linux/spinlock.h:334 [inline]
 #0: 00000000fa9ed0f4 (slock-AF_RXRPC){+...}, at: release_sock+0x20/0x1c0 net/core/sock.c:2798

stack backtrace:
CPU: 1 PID: 8256 Comm: syz-executor223 Not tainted 5.0.0-rc4+ #59
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x172/0x1f0 lib/dump_stack.c:113
 print_unlock_imbalance_bug kernel/locking/lockdep.c:3391 [inline]
 print_unlock_imbalance_bug.cold+0x114/0x123 kernel/locking/lockdep.c:3368
 __lock_release kernel/locking/lockdep.c:3601 [inline]
 lock_release+0x67e/0xa00 kernel/locking/lockdep.c:3860
 sock_release_ownership include/net/sock.h:1471 [inline]
 release_sock+0x183/0x1c0 net/core/sock.c:2808
 rxrpc_recvmsg+0x6d3/0x3099 net/rxrpc/recvmsg.c:598
 sock_recvmsg_nosec net/socket.c:794 [inline]
 sock_recvmsg net/socket.c:801 [inline]
 sock_recvmsg+0xd0/0x110 net/socket.c:797
 __sys_recvfrom+0x1ff/0x350 net/socket.c:1845
 __do_sys_recvfrom net/socket.c:1863 [inline]
 __se_sys_recvfrom net/socket.c:1859 [inline]
 __x64_sys_recvfrom+0xe1/0x1a0 net/socket.c:1859
 do_syscall_64+0x103/0x610 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446379
Code: e8 2c b3 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 2b 09 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007fe5da89fd98 EFLAGS: 00000246 ORIG_RAX: 000000000000002d
RAX: ffffffffffffffda RBX: 00000000006dbc28 RCX: 0000000000446379
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000003
RBP: 00000000006dbc20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dbc2c
R13: 0000000000000000 R14: 0000000000000000 R15: 20c49ba5e353f7cf

Fixes: 248f219cb8 ("rxrpc: Rewrite the data and ack handling code")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Howells <dhowells@redhat.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:54:07 -08:00
Martin Weinelt
5f320f09b3 batman-adv: fix memory leak in in batadv_dat_put_dhcp
batadv_dat_put_dhcp is creating a new ARP packet via
batadv_dat_arp_create_reply and tries to forward it via
batadv_dat_send_data to different peers in the DHT. The original skb is not
consumed by batadv_dat_send_data and thus has to be consumed by the caller.

Fixes: b61ec31c85 ("batman-adv: Snoop DHCPACKs for DAT")
Signed-off-by: Martin Weinelt <martin@linuxlounge.net>
[sven@narfation.org: add commit message]
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2019-02-06 19:42:52 +01:00
Pablo Neira Ayuso
eca4205f9e ethtool: add ethtool_rx_flow_spec to flow_rule structure translator
This patch adds a function to translate the ethtool_rx_flow_spec
structure to the flow_rule representation.

This allows us to reuse code from the driver side given that both flower
and ethtool_rx_flow interfaces use the same representation.

This patch also includes support for the flow type flags FLOW_EXT,
FLOW_MAC_EXT and FLOW_RSS.

The ethtool_rx_flow_spec_input wrapper structure is used to convey the
rss_context field, that is away from the ethtool_rx_flow_spec structure,
and the ethtool_rx_flow_spec structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:26 -08:00
Pablo Neira Ayuso
2cd173e6d5 cls_flower: don't expose TC actions to drivers anymore
Now that drivers have been converted to use the flow action
infrastructure, remove this field from the tc_cls_flower_offload
structure.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:26 -08:00
Pablo Neira Ayuso
3b1903ef97 flow_offload: add statistics retrieval infrastructure and use it
This patch provides the flow_stats structure that acts as container for
tc_cls_flower_offload, then we can use to restore the statistics on the
existing TC actions. Hence, tcf_exts_stats_update() is not used from
drivers anymore.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Pablo Neira Ayuso
3a7b68617d cls_api: add translator to flow_action representation
This patch implements a new function to translate from native TC action
to the new flow_action representation. Moreover, this patch also updates
cls_flower to use this new function.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Pablo Neira Ayuso
e3ab786b42 flow_offload: add flow action infrastructure
This new infrastructure defines the nic actions that you can perform
from existing network drivers. This infrastructure allows us to avoid a
direct dependency with the native software TC action representation.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Pablo Neira Ayuso
8f2566225a flow_offload: add flow_rule and flow_match structures and use them
This patch wraps the dissector key and mask - that flower uses to
represent the matching side - around the flow_match structure.

To avoid a follow up patch that would edit the same LoCs in the drivers,
this patch also wraps this new flow match structure around the flow rule
object. This new structure will also contain the flow actions in follow
up patches.

This introduces two new interfaces:

	bool flow_rule_match_key(rule, dissector_id)

that returns true if a given matching key is set on, and:

	flow_rule_match_XYZ(rule, &match);

To fetch the matching side XYZ into the match container structure, to
retrieve the key and the mask with one single call.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-06 10:38:25 -08:00
Jakub Kicinski
9ee963d6a1 net: xdp: allow generic and driver XDP on one interface
Since commit a25717d2b6 ("xdp: support simultaneous driver and
hw XDP attachment") users can load an XDP program for offload and
in native driver mode simultaneously.  Allow a similar mix of
offload and SKB mode/generic XDP.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2019-02-06 15:35:42 +01:00
Johannes Berg
73350424be cfg80211: pmsr: fix abort locking
When we destroy the interface we already hold the wdev->mtx
while calling cfg80211_pmsr_wdev_down(), which assumes this
isn't true and flushes the worker that takes the lock, thus
leading to a deadlock.

Fix this by refactoring the worker and calling its code in
cfg80211_pmsr_wdev_down() directly.

We still need to flush the work later to make sure it's not
still running and will crash, but it will not do anything.

Fixes: 9bb7e0f24e ("cfg80211: add peer measurement with FTM initiator API")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-06 10:38:00 +01:00
Johannes Berg
0acd99282b cfg80211: pmsr: fix MAC address setting
When we *don't* have a MAC address attribute, we shouldn't
try to use this - this was intended to copy the local MAC
address instead, so fix it.

Fixes: 9bb7e0f24e ("cfg80211: add peer measurement with FTM initiator API")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2019-02-06 10:37:42 +01:00
David S. Miller
f09bef61f1 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for net:

1) Use CONFIG_NF_TABLES_INET from seltests, not NF_TABLES_INET.
   From Naresh Kamboju.

2) Add a test to cover masquerading and redirect case, from Florian
   Westphal.

3) Two packets coming from the same socket may race to set up NAT,
   ending up with different tuples and the packet losing race being
   dropped. Update nf_conntrack_tuple_taken() to exercise clash
   resolution for this case. From Martynas Pumputis and Florian
   Westphal.

4) Unbind anonymous sets from the commit and abort path, this fixes
   a splat due to double set list removal/release in case that the
   transaction needs to be aborted.

5) Do not preserve original output interface for packets that are
   redirected in the output chain when ip6_route_me_harder() is
   called. Otherwise packets end up going not going to the loopback
   device. From Eli Cooper.

6) Fix bogus splat in nft_compat with CONFIG_REFCOUNT_FULL=y, this
   also simplifies the existing logic to deal with the list insertions
   of the xtables extensions. From Florian Westphal.

Diffstat look rather larger than usual because of the new selftest, but
Florian and I consider that having tests soon into the tree is good to
improve coverage. If there's a different policy in this regard, please,
let me know.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-05 11:23:23 -08:00
Florian Westphal
947e492c0f netfilter: nft_compat: don't use refcount_inc on newly allocated entry
When I moved the refcount to refcount_t type I missed the fact that
refcount_inc() will result in use-after-free warning with
CONFIG_REFCOUNT_FULL=y builds.

The correct fix would be to init the reference count to 1 at allocation
time, but, unfortunately we cannot do this, as we can't undo that
in case something else fails later in the batch.

So only solution I see is to special-case the 'new entry' condition
and replace refcount_inc() with a "delayed" refcount_set(1) in this case,
as done here.

The .activate callback can be removed to simplify things, we only
need to make sure that deactivate() decrements/unlinks the entry
from the list at end of transaction phase (commit or abort).

Fixes: 12c44aba66 ("netfilter: nft_compat: use refcnt_t type for nft_xt reference count")
Reported-by: Jordan Glover <Golden_Miller83@protonmail.ch>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-05 14:10:33 +01:00
Eli Cooper
15df03c661 netfilter: ipv6: Don't preserve original oif for loopback address
Commit 508b09046c ("netfilter: ipv6: Preserve link scope traffic
original oif") made ip6_route_me_harder() keep the original oif for
link-local and multicast packets. However, it also affected packets
for the loopback address because it used rt6_need_strict().

REDIRECT rules in the OUTPUT chain rewrite the destination to loopback
address; thus its oif should not be preserved. This commit fixes the bug
that redirected local packets are being dropped. Actually the packet was
not exactly dropped; Instead it was sent out to the original oif rather
than lo. When a packet with daddr ::1 is sent to the router, it is
effectively dropped.

Fixes: 508b09046c ("netfilter: ipv6: Preserve link scope traffic original oif")
Signed-off-by: Eli Cooper <elicooper@gmx.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-05 14:10:33 +01:00
Marc Zyngier
c8101f7729 net: dsa: Fix lockdep false positive splat
Creating a macvtap on a DSA-backed interface results in the following
splat when lockdep is enabled:

[   19.638080] IPv6: ADDRCONF(NETDEV_CHANGE): lan0: link becomes ready
[   23.041198] device lan0 entered promiscuous mode
[   23.043445] device eth0 entered promiscuous mode
[   23.049255]
[   23.049557] ============================================
[   23.055021] WARNING: possible recursive locking detected
[   23.060490] 5.0.0-rc3-00013-g56c857a1b8d3 #118 Not tainted
[   23.066132] --------------------------------------------
[   23.071598] ip/2861 is trying to acquire lock:
[   23.076171] 00000000f61990cb (_xmit_ETHER){+...}, at: dev_set_rx_mode+0x1c/0x38
[   23.083693]
[   23.083693] but task is already holding lock:
[   23.089696] 00000000ecf0c3b4 (_xmit_ETHER){+...}, at: dev_uc_add+0x24/0x70
[   23.096774]
[   23.096774] other info that might help us debug this:
[   23.103494]  Possible unsafe locking scenario:
[   23.103494]
[   23.109584]        CPU0
[   23.112093]        ----
[   23.114601]   lock(_xmit_ETHER);
[   23.117917]   lock(_xmit_ETHER);
[   23.121233]
[   23.121233]  *** DEADLOCK ***
[   23.121233]
[   23.127325]  May be due to missing lock nesting notation
[   23.127325]
[   23.134315] 2 locks held by ip/2861:
[   23.137987]  #0: 000000003b766c72 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x338/0x4e0
[   23.146231]  #1: 00000000ecf0c3b4 (_xmit_ETHER){+...}, at: dev_uc_add+0x24/0x70
[   23.153757]
[   23.153757] stack backtrace:
[   23.158243] CPU: 0 PID: 2861 Comm: ip Not tainted 5.0.0-rc3-00013-g56c857a1b8d3 #118
[   23.166212] Hardware name: Globalscale Marvell ESPRESSOBin Board (DT)
[   23.172843] Call trace:
[   23.175358]  dump_backtrace+0x0/0x188
[   23.179116]  show_stack+0x14/0x20
[   23.182524]  dump_stack+0xb4/0xec
[   23.185928]  __lock_acquire+0x123c/0x1860
[   23.190048]  lock_acquire+0xc8/0x248
[   23.193724]  _raw_spin_lock_bh+0x40/0x58
[   23.197755]  dev_set_rx_mode+0x1c/0x38
[   23.201607]  dev_set_promiscuity+0x3c/0x50
[   23.205820]  dsa_slave_change_rx_flags+0x5c/0x70
[   23.210567]  __dev_set_promiscuity+0x148/0x1e0
[   23.215136]  __dev_set_rx_mode+0x74/0x98
[   23.219167]  dev_uc_add+0x54/0x70
[   23.222575]  macvlan_open+0x170/0x1d0
[   23.226336]  __dev_open+0xe0/0x160
[   23.229830]  __dev_change_flags+0x16c/0x1b8
[   23.234132]  dev_change_flags+0x20/0x60
[   23.238074]  do_setlink+0x2d0/0xc50
[   23.241658]  __rtnl_newlink+0x5f8/0x6e8
[   23.245601]  rtnl_newlink+0x50/0x78
[   23.249184]  rtnetlink_rcv_msg+0x360/0x4e0
[   23.253397]  netlink_rcv_skb+0xe8/0x130
[   23.257338]  rtnetlink_rcv+0x14/0x20
[   23.261012]  netlink_unicast+0x190/0x210
[   23.265043]  netlink_sendmsg+0x288/0x350
[   23.269075]  sock_sendmsg+0x18/0x30
[   23.272659]  ___sys_sendmsg+0x29c/0x2c8
[   23.276602]  __sys_sendmsg+0x60/0xb8
[   23.280276]  __arm64_sys_sendmsg+0x1c/0x28
[   23.284488]  el0_svc_common+0xd8/0x138
[   23.288340]  el0_svc_handler+0x24/0x80
[   23.292192]  el0_svc+0x8/0xc

This looks fairly harmless (no actual deadlock occurs), and is
fixed in a similar way to c6894dec8e ("bridge: fix lockdep
addr_list_lock false positive splat") by putting the addr_list_lock
in its own lockdep class.

Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 18:30:43 -08:00
Rundong Ge
17ab4f61b8 net: dsa: slave: Don't propagate flag changes on down slave interfaces
The unbalance of master's promiscuity or allmulti will happen after ifdown
and ifup a slave interface which is in a bridge.

When we ifdown a slave interface , both the 'dsa_slave_close' and
'dsa_slave_change_rx_flags' will clear the master's flags. The flags
of master will be decrease twice.
In the other hand, if we ifup the slave interface again, since the
slave's flags were cleared the 'dsa_slave_open' won't set the master's
flag, only 'dsa_slave_change_rx_flags' that triggered by 'br_add_if'
will set the master's flags. The flags of master is increase once.

Only propagating flag changes when a slave interface is up makes
sure this does not happen. The 'vlan_dev_change_rx_flags' had the
same problem and was fixed, and changes here follows that fix.

Fixes: 91da11f870 ("net: Distributed Switch Architecture protocol support")
Signed-off-by: Rundong Ge <rdong.ge@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 18:29:35 -08:00
Santosh Shilimkar
fd261ce6a3 rds: rdma: update rdma transport for tos
For RDMA transports, RDS TOS is an extension of IB QoS(Annex A13)
to provide clients the ability to segregate traffic flows for
different type of data. RDMA CM abstract it for ULPs using
rdma_set_service_type(). Internally, each traffic flow is
represented by a connection with all of its independent resources
like that of a normal connection, and is differentiated by
service type. In other words, there can be multiple qp connections
between an IP pair and each supports a unique service type.

The feature has been added from RDSv4.1 onwards and supports
rolling upgrades. RDMA connection metadata also carries the tos
information to set up SL on end to end context. The original
code was developed by Bang Nguyen in downstream kernel back in
2.6.32 kernel days and it has evolved over period of time.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04 14:59:13 -08:00
Santosh Shilimkar
56dc8bce9f rds: add transport specific tos_map hook
RDMA transport maps user tos to underline virtual lanes(VL)
for IB or DSCP values. RDMA CM transport abstract thats for
RDS. TCP transport makes use of default priority 0 and maps
all user tos values to it.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04 14:59:13 -08:00
Santosh Shilimkar
3eb450367d rds: add type of service(tos) infrastructure
RDS Service type (TOS) is user-defined and needs to be configured
via RDS IOCTL interface. It must be set before initiating any
traffic and once set the TOS can not be changed. All out-going
traffic from the socket will be associated with its TOS.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04 14:59:12 -08:00
Santosh Shilimkar
d021fabf52 rds: rdma: add consumer reject
For legacy protocol version incompatibility with non linux RDS,
consumer reject reason being used to convey it to peer. But the
choice of reject reason value as '1' was really poor.

Anyway for interoperability reasons with shipping products,
it needs to be supported. For any future versions, properly
encoded reject reason should to be used.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04 14:59:11 -08:00
Santosh Shilimkar
cdc306a5c9 rds: make v3.1 as compat version
Mark RDSv3.1 as compat version and add v4.1 version macro's.
Subsequent patches enable TOS(Type of Service) feature which is
tied with v4.1 for RDMA transport.

Reviewed-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
[yanjun.zhu@oracle.com: Adapted original patch with ipv6 changes]
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
2019-02-04 14:59:11 -08:00
Florian Westphal
ac02bcf9cc netfilter: ipv6: avoid indirect calls for IPV6=y case
indirect calls are only needed if ipv6 is a module.
Add helpers to abstract the v6ops indirections and use them instead.

fragment, reroute and route_input are kept as indirect calls.
The first two are not not used in hot path and route_input is only
used by bridge netfilter.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 18:21:12 +01:00
Florian Westphal
960587285a netfilter: nat: remove module dependency on ipv6 core
nf_nat_ipv6 calls two ipv6 core functions, so add those to v6ops to avoid
the module dependency.

This is a prerequisite for merging ipv4 and ipv6 nat implementations.

Add wrappers to avoid the indirection if ipv6 is builtin.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 18:20:19 +01:00
Petr Machata
c1f7e02979 net: cls_flower: Remove filter from mask before freeing it
In fl_change(), when adding a new rule (i.e. fold == NULL), a driver may
reject the new rule, for example due to resource exhaustion. By that
point, the new rule was already assigned a mask, and it was added to
that mask's hash table. The clean-up path that's invoked as a result of
the rejection however neglects to undo the hash table addition, and
proceeds to free the new rule, thus leaving a dangling pointer in the
hash table.

Fix by removing fnew from the mask's hash table before it is freed.

Fixes: 35cc3cefc4 ("net/sched: cls_flower: Reject duplicated rules also under skip_sw")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 09:19:14 -08:00
Ursula Braun
84b799a292 net/smc: correct state change for peer closing
If some kind of closing is received from the peer while still in
state SMC_INIT, it means the peer has had an active connection and
closed the socket quickly before listen_work finished. This should
not result in a shortcut from state SMC_INIT to state SMC_CLOSED.
This patch adds the socket to the accept queue in state
SMC_APPCLOSEWAIT1. The socket reaches state SMC_CLOSED once being
accepted and closed with smc_release().

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 09:11:19 -08:00
Ursula Braun
a5e04318c8 net/smc: delete rkey first before switching to unused
Once RMBs are flagged as unused they are candidates for reuse.
Thus the LLC DELETE RKEY operaton should be made before flagging
the RMB as unused.

Fixes: c7674c001b ("net/smc: unregister rkeys of unused buffer")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 09:11:19 -08:00
Ursula Braun
b8649efad8 net/smc: fix sender_free computation
In some scenarios a separate consumer cursor update is necessary.
The decision is made in smc_tx_consumer_cursor_update(). The
sender_free computation could be wrong:

The rx confirmed cursor is always smaller than or equal to the
rx producer cursor. The parameters in the smc_curs_diff() call
have to be exchanged, otherwise sender_free might even be negative.

And if more data arrives local_rx_ctrl.prod might be updated, enabling
a cursor difference between local_rx_ctrl.prod and rx confirmed cursor
larger than the RMB size. This case is not covered by smc_curs_diff().
Thus function smc_curs_diff_large() is introduced here.

If a recvmsg() is processed in parallel, local_tx_ctrl.cons might
change during smc_cdc_msg_send. Make sure rx_curs_confirmed is updated
with the actually sent local_tx_ctrl.cons value.

Fixes: e82f2e31f5 ("net/smc: optimize consumer cursor updates")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 09:11:19 -08:00
Ursula Braun
ad6f317f72 net/smc: preallocated memory for rdma work requests
The work requests for rdma writes are built in local variables within
function smc_tx_rdma_write(). This violates the rule that the work
request storage has to stay till the work request is confirmed by
a completion queue response.
This patch introduces preallocated memory for these work requests.
The storage is allocated, once a link (and thus a queue pair) is
established.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-04 09:11:19 -08:00
Pablo Neira Ayuso
f6ac858589 netfilter: nf_tables: unbind set in rule from commit path
Anonymous sets that are bound to rules from the same transaction trigger
a kernel splat from the abort path due to double set list removal and
double free.

This patch updates the logic to search for the transaction that is
responsible for creating the set and disable the set list removal and
release, given the rule is now responsible for this. Lookup is reverse
since the transaction that adds the set is likely to be at the tail of
the list.

Moreover, this patch adds the unbind step to deliver the event from the
commit path.  This should not be done from the worker thread, since we
have no guarantees of in-order delivery to the listener.

This patch removes the assumption that both activate and deactivate
callbacks need to be provided.

Fixes: cd5125d8f5 ("netfilter: nf_tables: split set destruction in deactivate and destroy phase")
Reported-by: Mikhail Morfikov <mmorfikov@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 17:29:17 +01:00
wenxu
a46c52d9f2 netfilter: nft_tunnel: Add NFTA_TUNNEL_MODE options
nft "tunnel" expr match both the tun_info of RX and TX. This patch
provide the NFTA_TUNNEL_MODE to individually match the tun_info of
RX or TX.

Signed-off-by: wenxu <wenxu@ucloud.cn>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 14:39:58 +01:00
Martynas Pumputis
4e35c1cb94 netfilter: nf_nat: skip nat clash resolution for same-origin entries
It is possible that two concurrent packets originating from the same
socket of a connection-less protocol (e.g. UDP) can end up having
different IP_CT_DIR_REPLY tuples which results in one of the packets
being dropped.

To illustrate this, consider the following simplified scenario:

1. Packet A and B are sent at the same time from two different threads
   by same UDP socket.  No matching conntrack entry exists yet.
   Both packets cause allocation of a new conntrack entry.
2. get_unique_tuple gets called for A.  No clashing entry found.
   conntrack entry for A is added to main conntrack table.
3. get_unique_tuple is called for B and will find that the reply
   tuple of B is already taken by A.
   It will allocate a new UDP source port for B to resolve the clash.
4. conntrack entry for B cannot be added to main conntrack table
   because its ORIGINAL direction is clashing with A and the REPLY
   directions of A and B are not the same anymore due to UDP source
   port reallocation done in step 3.

This patch modifies nf_conntrack_tuple_taken so it doesn't consider
colliding reply tuples if the IP_CT_DIR_ORIGINAL tuples are equal.

[ Florian: simplify patch to not use .allow_clash setting
  and always ignore identical flows ]

Signed-off-by: Martynas Pumputis <martynas@weave.works>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2019-02-04 14:26:11 +01:00
David S. Miller
ff7653f94b net: Fix fall through warning in y2038 tstamp changes.
net/core/sock.c: In function 'sock_setsockopt':
net/core/sock.c:914:3: warning: this statement may fall through [-Wimplicit-fallthrough=]
   sock_set_flag(sk, SOCK_TSTAMP_NEW);
   ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
net/core/sock.c:915:2: note: here
  case SO_TIMESTAMPING_OLD:
  ^~~~

Fixes: 9718475e69 ("socket: Add SO_TIMESTAMPING_NEW")
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 20:25:31 -08:00
Masahiro Yamada
303a339f30 bpfilter: remove extra header search paths for bpfilter_umh
Currently, the header search paths -Itools/include and
-Itools/include/uapi are not used. Let's drop the unused code.

We can remove -I. too by fixing up one C file.

Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 20:11:43 -08:00
Xin Long
cfe4bd7a25 sctp: check and update stream->out_curr when allocating stream_out
Now when using stream reconfig to add out streams, stream->out
will get re-allocated, and all old streams' information will
be copied to the new ones and the old ones will be freed.

So without stream->out_curr updated, next time when trying to
send from stream->out_curr stream, a panic would be caused.

This patch is to check and update stream->out_curr when
allocating stream_out.

v1->v2:
  - define fa_index() to get elem index from stream->out_curr.
v2->v3:
  - repost with no change.

Fixes: 5bbbbe32a4 ("sctp: introduce stream scheduler foundations")
Reported-by: Ying Xu <yinxu@redhat.com>
Reported-by: syzbot+e33a3a138267ca119c7d@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 14:27:47 -08:00
Florian Fainelli
9fb20801da net: Fix ip_mc_{dec,inc}_group allocation context
After 4effd28c12 ("bridge: join all-snoopers multicast address"), I
started seeing the following sleep in atomic warnings:

[   26.763893] BUG: sleeping function called from invalid context at mm/slab.h:421
[   26.771425] in_atomic(): 1, irqs_disabled(): 0, pid: 1658, name: sh
[   26.777855] INFO: lockdep is turned off.
[   26.781916] CPU: 0 PID: 1658 Comm: sh Not tainted 5.0.0-rc4 #20
[   26.787943] Hardware name: BCM97278SV (DT)
[   26.792118] Call trace:
[   26.794645]  dump_backtrace+0x0/0x170
[   26.798391]  show_stack+0x24/0x30
[   26.801787]  dump_stack+0xa4/0xe4
[   26.805182]  ___might_sleep+0x208/0x218
[   26.809102]  __might_sleep+0x78/0x88
[   26.812762]  kmem_cache_alloc_trace+0x64/0x28c
[   26.817301]  igmp_group_dropped+0x150/0x230
[   26.821573]  ip_mc_dec_group+0x1b0/0x1f8
[   26.825585]  br_ip4_multicast_leave_snoopers.isra.11+0x174/0x190
[   26.831704]  br_multicast_toggle+0x78/0xcc
[   26.835887]  store_bridge_parm+0xc4/0xfc
[   26.839894]  multicast_snooping_store+0x3c/0x4c
[   26.844517]  dev_attr_store+0x44/0x5c
[   26.848262]  sysfs_kf_write+0x50/0x68
[   26.852006]  kernfs_fop_write+0x14c/0x1b4
[   26.856102]  __vfs_write+0x60/0x190
[   26.859668]  vfs_write+0xc8/0x168
[   26.863059]  ksys_write+0x70/0xc8
[   26.866449]  __arm64_sys_write+0x24/0x30
[   26.870458]  el0_svc_common+0xa0/0x11c
[   26.874291]  el0_svc_handler+0x38/0x70
[   26.878120]  el0_svc+0x8/0xc

while toggling the bridge's multicast_snooping attribute dynamically.

Pass a gfp_t down to igmpv3_add_delrec(), introduce
__igmp_group_dropped() and introduce __ip_mc_dec_group() to take a gfp_t
argument.

Similarly introduce ____ip_mc_inc_group() and __ip_mc_inc_group() to
allow caller to specify gfp_t.

IPv6 part of the patch appears fine.

Fixes: 4effd28c12 ("bridge: join all-snoopers multicast address")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 12:11:12 -08:00
Jakub Kicinski
bff5731d43 net: devlink: report cell size of shared buffers
Shared buffer allocation is usually done in cell increments.
Drivers will either round up the allocation or refuse the
configuration if it's not an exact multiple of cell size.
Drivers know exactly the cell size of shared buffer, so help
out users by providing this information in dumps.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:25:34 -08:00
Deepa Dinamani
a9beb86ae6 sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW
Add new socket timeout options that are y2038 safe.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: ccaulfie@redhat.com
Cc: davem@davemloft.net
Cc: deller@gmx.de
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:31 -08:00
Deepa Dinamani
45bdc66159 socket: Rename SO_RCVTIMEO/ SO_SNDTIMEO with _OLD suffixes
SO_RCVTIMEO and SO_SNDTIMEO socket options use struct timeval
as the time format. struct timeval is not y2038 safe.
The subsequent patches in the series add support for new socket
timeout options with _NEW suffix that will use y2038 safe
data structures. Although the existing struct timeval layout
is sufficiently wide to represent timeouts, because of the way
libc will interpret time_t based on user defined flag, these
new flags provide a way of having a structure that is the same
for all architectures consistently.
Rename the existing options with _OLD suffix forms so that the
right option is enabled for userspace applications according
to the architecture and time_t definition of libc.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: ccaulfie@redhat.com
Cc: deller@gmx.de
Cc: paulus@samba.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: cluster-devel@redhat.com
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@vger.kernel.org
Cc: linux-parisc@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:31 -08:00
Deepa Dinamani
9718475e69 socket: Add SO_TIMESTAMPING_NEW
Add SO_TIMESTAMPING_NEW variant of socket timestamp options.
This is the y2038 safe versions of the SO_TIMESTAMPING_OLD
for all architectures.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: chris@zankel.net
Cc: fenghua.yu@intel.com
Cc: rth@twiddle.net
Cc: tglx@linutronix.de
Cc: ubraun@linux.ibm.com
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-s390@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:31 -08:00
Deepa Dinamani
887feae36a socket: Add SO_TIMESTAMP[NS]_NEW
Add SO_TIMESTAMP_NEW and SO_TIMESTAMPNS_NEW variants of
socket timestamp options.
These are the y2038 safe versions of the SO_TIMESTAMP_OLD
and SO_TIMESTAMPNS_OLD for all architectures.

Note that the format of scm_timestamping.ts[0] is not changed
in this patch.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: jejb@parisc-linux.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: linux-alpha@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:31 -08:00
Deepa Dinamani
13c6ee2a92 socket: Use old_timeval types for socket timestamps
As part of y2038 solution, all internal uses of
struct timeval are replaced by struct __kernel_old_timeval
and struct compat_timeval by struct old_timeval32.
Make socket timestamps use these new types.

This is mainly to be able to verify that the kernel build
is y2038 safe when such non y2038 safe types are not
supported anymore.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: isdn@linux-pingi.de
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:30 -08:00
Deepa Dinamani
7f1bc6e95d sockopt: Rename SO_TIMESTAMP* to SO_TIMESTAMP*_OLD
SO_TIMESTAMP, SO_TIMESTAMPNS and SO_TIMESTAMPING options, the
way they are currently defined, are not y2038 safe.
Subsequent patches in the series add new y2038 safe versions
of these options which provide 64 bit timestamps on all
architectures uniformly.
Hence, rename existing options with OLD tag suffixes.

Also note that kernel will not use the untagged SO_TIMESTAMP*
and SCM_TIMESTAMP* options internally anymore.

Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Cc: deller@gmx.de
Cc: dhowells@redhat.com
Cc: jejb@parisc-linux.org
Cc: ralf@linux-mips.org
Cc: rth@twiddle.net
Cc: linux-afs@lists.infradead.org
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linux-rdma@vger.kernel.org
Cc: sparclinux@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:30 -08:00
Arnd Bergmann
fe0c72f3db socket: move compat timeout handling into sock.c
This is a cleanup to prepare for the addition of 64-bit time_t
in O_SNDTIMEO/O_RCVTIMEO. The existing compat handler seems
unnecessarily complex and error-prone, moving it all into the
main setsockopt()/getsockopt() implementation requires half
as much code and is easier to extend.

32-bit user space can now use old_timeval32 on both 32-bit
and 64-bit machines, while 64-bit code can use
__old_kernel_timeval.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Deepa Dinamani <deepa.kernel@gmail.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:17:30 -08:00
Stefano Garzarella
85965487ab vsock/virtio: reset connected sockets on device removal
When the virtio transport device disappear, we should reset all
connected sockets in order to inform the users.

Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:06:25 -08:00
Stefano Garzarella
22b5c0b63f vsock/virtio: fix kernel panic after device hot-unplug
virtio_vsock_remove() invokes the vsock_core_exit() also if there
are opened sockets for the AF_VSOCK protocol family. In this way
the vsock "transport" pointer is set to NULL, triggering the
kernel panic at the first socket activity.

This patch move the vsock_core_init()/vsock_core_exit() in the
virtio_vsock respectively in module_init and module_exit functions,
that cannot be invoked until there are open sockets.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1609699
Reported-by: Yan Fu <yafu@redhat.com>
Signed-off-by: Stefano Garzarella <sgarzare@redhat.com>
Acked-by: Stefan Hajnoczi <stefanha@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 11:06:25 -08:00
Edward Chron
1d2f4ebbbe ipv4/igmp: Don't drop IGMP pkt with zeros src addr
Don't drop IGMP packets with a source address of all zeros which are
IGMP proxy reports. This is documented in Section 2.1.1 IGMP
Forwarding Rules of RFC 4541 IGMP and MLD Snooping Switches
Considerations.

Signed-off-by: Edward Chron <echron@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-03 09:54:05 -08:00
David S. Miller
beb73559bf Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Alexei Starovoitov says:

====================
pull-request: bpf-next 2019-02-01

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) introduce bpf_spin_lock, from Alexei.

2) convert xdp samples to libbpf, from Maciej.

3) skip verifier tests for unsupported program/map types, from Stanislav.

4) powerpc64 JIT support for BTF line info, from Sandipan.

5) assorted fixed, from Valdis, Jesper, Jiong.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 20:12:18 -08:00
Jakub Kicinski
ddb6e99e2d ethtool: add compat for devlink info
If driver did not fill the fw_version field, try to call into
the new devlink get_info op and collect the versions that way.
We assume ethtool was always reporting running versions.

v4:
 - use IS_REACHABLE() to avoid problems with DEVLINK=m (kbuildbot).
v3 (Jiri):
 - do a dump and then parse it instead of special handling;
 - concatenate all versions (well, all that fit :)).

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:30:31 -08:00
Jakub Kicinski
fc6fae7dd9 devlink: add version reporting to devlink info API
ethtool -i has a few fixed-size fields which can be used to report
firmware version and expansion ROM version. Unfortunately, modern
hardware has more firmware components. There is usually some
datapath microcode, management controller, PXE drivers, and a
CPLD load. Running ethtool -i on modern controllers reveals the
fact that vendors cram multiple values into firmware version field.

Here are some examples from systems I could lay my hands on quickly:

tg3:  "FFV20.2.17 bc 5720-v1.39"
i40e: "6.01 0x800034a4 1.1747.0"
nfp:  "0.0.3.5 0.25 sriov-2.1.16 nic"

Add a new devlink API to allow retrieving multiple versions, and
provide user-readable name for those versions.

While at it break down the versions into three categories:
 - fixed - this is the board/fixed component version, usually vendors
           report information like the board version in the PCI VPD,
           but it will benefit from naming and common API as well;
 - running - this is the running firmware version;
 - stored - this is firmware in the flash, after firmware update
            this value will reflect the flashed version, while the
            running version may only be updated after reboot.

v3:
 - add per-type helpers instead of using the special argument (Jiri).
RFCv2:
 - remove the nesting in attr DEVLINK_ATTR_INFO_VERSIONS (now
   versions are mixed with other info attrs)l
 - have the driver report versions from the same callback as
   other info.

Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2019-02-01 15:30:30 -08:00