Commit Graph

53672 Commits

Author SHA1 Message Date
Willem de Bruijn
d0e13a1488 flow_dissector: lookup netns by skb->sk if skb->dev is NULL
BPF flow dissectors are configured per network namespace.
__skb_flow_dissect looks up the netns through dev_net(skb->dev).

In some dissector paths skb->dev is NULL, such as for Unix sockets.
In these cases fall back to looking up the netns by socket.

Analyzing the codepaths leading to __skb_flow_dissect I did not find
a case where both skb->dev and skb->sk are NULL. Warn and fall back to
standard flow dissector if one is found.

Fixes: d58e468b11 ("flow_dissector: implements flow dissector BPF hook")
Reported-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-25 17:31:19 +02:00
Paolo Abeni
ccfec9e5cb ip_tunnel: be careful when accessing the inner header
Cong noted that we need the same checks introduced by commit 76c0ddd8c3
("ip6_tunnel: be careful when accessing the inner header")
even for ipv4 tunnels.

Fixes: c544193214 ("GRE: Refactor GRE tunneling code.")
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:27:04 -07:00
Vakul Garg
4128c0cfb1 tls: Fixed uninitialised vars warning
In tls_sw_sendmsg() and tls_sw_sendpage(), it is possible that the
uninitialised variable 'ret' gets passed to sk_stream_error(). So
initialise local variable 'ret' to '0. The warnings were detected by
'smatch' tool.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:24:52 -07:00
Vakul Garg
9932a29ab1 net/tls: Fixed race condition in async encryption
On processors with multi-engine crypto accelerators, it is possible that
multiple records get encrypted in parallel and their encryption
completion is notified to different cpus in multicore processor. This
leads to the situation where tls_encrypt_done() starts executing in
parallel on different cores. In current implementation, encrypted
records are queued to tx_ready_list in tls_encrypt_done(). This requires
addition to linked list 'tx_ready_list' to be protected. As
tls_decrypt_done() could be executing in irq content, it is not possible
to protect linked list addition operation using a lock.

To fix the problem, we remove linked list addition operation from the
irq context. We do tx_ready_list addition/removal operation from
application context only and get rid of possible multiple access to
the linked list. Before starting encryption on the record, we add it to
the tail of tx_ready_list. To prevent tls_tx_records() from transmitting
it, we mark the record with a new flag 'tx_ready' in 'struct tls_rec'.
When record encryption gets completed, tls_encrypt_done() has to only
update the 'tx_ready' flag to true & linked list add operation is not
required.

The changed logic brings some other side benefits. Since the records
are always submitted in tls sequence number order for encryption, the
tx_ready_list always remains sorted and addition of new records to it
does not have to traverse the linked list.

Lastly, we renamed tx_ready_list in 'struct tls_sw_context_tx' to
'tx_list'. This is because now, the some of the records at the tail are
not ready to transmit.

Fixes: a42055e8d2 ("net/tls: Add support for async encryption")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:24:24 -07:00
Roopa Prabhu
fc6e8073f3 neighbour: send netlink notification if NTF_ROUTER changes
send netlink notification if neigh_update results in NTF_ROUTER
change and if NEIGH_UPDATE_F_ISROUTER is on. Also move the
NTF_ROUTER change function into a helper.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:21:32 -07:00
Roopa Prabhu
f7aa74e483 neighbour: allow admin to set NTF_ROUTER
This patch allows admin setting of NTF_ROUTER flag
on a neighbour entry. This enables external control
plane (like bgp evpn) to manage neigh entries with
NTF_ROUTER flag.

Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:21:32 -07:00
Saif Hasan
d8e2262a50 mpls: allow routes on ip6gre devices
Summary:

This appears to be necessary and sufficient change to enable `MPLS` on
`ip6gre` tunnels (RFC4023).

This diff allows IP6GRE devices to be recognized by MPLS kernel module
and hence user can configure interface to accept packets with mpls
headers as well setup mpls routes on them.

Test Plan:

Test plan consists of multiple containers connected via GRE-V6 tunnel.
Then carrying out testing steps as below.

- Carry out necessary sysctl settings on all containers

```
sysctl -w net.mpls.platform_labels=65536
sysctl -w net.mpls.ip_ttl_propagate=1
sysctl -w net.mpls.conf.lo.input=1
```

- Establish IP6GRE tunnels

```
ip -6 tunnel add name if_1_2_1 mode ip6gre \
  local 2401:db00:21:6048:feed:0::1 \
  remote 2401:db00:21:6048:feed:0::2 key 1
ip link set dev if_1_2_1 up
sysctl -w net.mpls.conf.if_1_2_1.input=1
ip -4 addr add 169.254.0.2/31 dev if_1_2_1 scope link

ip -6 tunnel add name if_1_3_1 mode ip6gre \
  local 2401:db00:21:6048:feed:0::1 \
  remote 2401:db00:21:6048:feed:0::3 key 1
ip link set dev if_1_3_1 up
sysctl -w net.mpls.conf.if_1_3_1.input=1
ip -4 addr add 169.254.0.4/31 dev if_1_3_1 scope link
```

- Install MPLS encap rules on node-1 towards node-2

```
ip route add 192.168.0.11/32 nexthop encap mpls 32/64 \
  via inet 169.254.0.3 dev if_1_2_1
```

- Install MPLS forwarding rules on node-2 and node-3
```
// node2
ip -f mpls route add 32 via inet 169.254.0.7 dev if_2_4_1

// node3
ip -f mpls route add 64 via inet 169.254.0.12 dev if_4_3_1
```

- Ping 192.168.0.11 (node4) from 192.168.0.1 (node1) (where routing
  towards 192.168.0.1 is via IP route directly towards node1 from node4)
```
ping 192.168.0.11
```

- tcpdump on interface to capture ping packets wrapped within MPLS
  header which inturn wrapped within IP6GRE header

```
16:43:41.121073 IP6
  2401:db00:21:6048:feed::1 > 2401:db00:21:6048:feed::2:
  DSTOPT GREv0, key=0x1, length 100:
  MPLS (label 32, exp 0, ttl 255) (label 64, exp 0, [S], ttl 255)
  IP 192.168.0.1 > 192.168.0.11:
  ICMP echo request, id 1208, seq 45, length 64

0x0000:  6000 2cdb 006c 3c3f 2401 db00 0021 6048  `.,..l<?$....!`H
0x0010:  feed 0000 0000 0001 2401 db00 0021 6048  ........$....!`H
0x0020:  feed 0000 0000 0002 2f00 0401 0401 0100  ......../.......
0x0030:  2000 8847 0000 0001 0002 00ff 0004 01ff  ...G............
0x0040:  4500 0054 3280 4000 ff01 c7cb c0a8 0001  E..T2.@.........
0x0050:  c0a8 000b 0800 a8d7 04b8 002d 2d3c a05b  ...........--<.[
0x0060:  0000 0000 bcd8 0100 0000 0000 1011 1213  ................
0x0070:  1415 1617 1819 1a1b 1c1d 1e1f 2021 2223  .............!"#
0x0080:  2425 2627 2829 2a2b 2c2d 2e2f 3031 3233  $%&'()*+,-./0123
0x0090:  3435 3637                                4567
```

Signed-off-by: Saif Hasan <has@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:19:27 -07:00
Eelco Chaudron
28169abadb net/sched: Add hardware specific counters to TC actions
Add additional counters that will store the bytes/packets processed by
hardware. These will be exported through the netlink interface for
displaying by the iproute2 tc tool

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:18:42 -07:00
Eelco Chaudron
5e111210a4 net/core: Add new basic hardware counter
Add a new hardware specific basic counter, TCA_STATS_BASIC_HW. This can
be used to count packets/bytes processed by hardware offload.

Signed-off-by: Eelco Chaudron <echaudro@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-24 12:18:42 -07:00
Eric Dumazet
ac3d9dd034 netpoll: make ndo_poll_controller() optional
As diagnosed by Song Liu, ndo_poll_controller() can
be very dangerous on loaded hosts, since the cpu
calling ndo_poll_controller() might steal all NAPI
contexts (for all RX/TX queues of the NIC). This capture
can last for unlimited amount of time, since one
cpu is generally not able to drain all the queues under load.

It seems that all networking drivers that do use NAPI
for their TX completions, should not provide a ndo_poll_controller().

NAPI drivers have netpoll support already handled
in core networking stack, since netpoll_poll_dev()
uses poll_napi(dev) to iterate through registered
NAPI contexts for a device.

This patch allows netpoll_poll_dev() to process NAPI
contexts even for drivers not providing ndo_poll_controller(),
allowing for following patches in NAPI drivers.

Also we export netpoll_poll_dev() so that it can be called
by bonding/team drivers in following patches.

Reported-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Tested-by: Song Liu <songliubraving@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23 21:55:24 -07:00
David S. Miller
16fdf8ba98 rds: Fix build regression.
Use DECLARE_* not DEFINE_*

Fixes: 8360ed6745 ("RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats")
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-23 12:25:15 -07:00
Maciej Żenczykowski
474ff26008 net-ethtool: ETHTOOL_GUFO did not and should not require CAP_NET_ADMIN
So it should not fail with EPERM even though it is no longer implemented...

This is a fix for:
  (userns)$ egrep ^Cap /proc/self/status
  CapInh: 0000003fffffffff
  CapPrm: 0000003fffffffff
  CapEff: 0000003fffffffff
  CapBnd: 0000003fffffffff
  CapAmb: 0000003fffffffff

  (userns)$ tcpdump -i usb_rndis0
  tcpdump: WARNING: usb_rndis0: SIOCETHTOOL(ETHTOOL_GUFO) ioctl failed: Operation not permitted
  Warning: Kernel filter failed: Bad file descriptor
  tcpdump: can't remove kernel filter: Bad file descriptor

With this change it returns EOPNOTSUPP instead of EPERM.

See also https://github.com/the-tcpdump-group/libpcap/issues/689

Fixes: 08a00fea6d "net: Remove references to NETIF_F_UFO from ethtool."
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Maciej Żenczykowski <maze@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-22 17:11:26 -07:00
Nathan Chancellor
5b9b0a80aa decnet: Remove unnecessary check for dev->name
Clang warns that the address of a pointer will always evaluated as true
in a boolean context.

net/decnet/dn_dev.c:1366:10: warning: address of array 'dev->name' will
always evaluate to 'true' [-Wpointer-bool-conversion]
                                dev->name ? dev->name : "???",
                                ~~~~~^~~~ ~
1 warning generated.

Link: https://github.com/ClangBuiltLinux/linux/issues/116
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:48:36 -07:00
Peter Oskolkov
8361962392 net/ipfrag: let ip[6]frag_high_thresh in ns be higher than in init_net
Currently, ip[6]frag_high_thresh sysctl values in new namespaces are
hard-limited to those of the root/init ns.

There are at least two use cases when it would be desirable to
set the high_thresh values higher in a child namespace vs the global hard
limit:

- a security/ddos protection policy may lower the thresholds in the
  root/init ns but allow for a special exception in a child namespace
- testing: a test running in a namespace may want to set these
  thresholds higher in its namespace than what is in the root/init ns

The new behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
 net.ipv6.ip6frag_high_thresh = 9000000

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 9000000

The old behavior:

 # ip netns add testns
 # ip netns exec testns bash

 # sysctl -w net.ipv4.ipfrag_high_thresh=9000000
 net.ipv4.ipfrag_high_thresh = 9000000

 # sysctl net.ipv4.ipfrag_high_thresh
 net.ipv4.ipfrag_high_thresh = 4194304

 # sysctl -w net.ipv6.ip6frag_high_thresh=9000000
 net.ipv6.ip6frag_high_thresh = 9000000

 # sysctl net.ipv6.ip6frag_high_thresh
 net.ipv6.ip6frag_high_thresh = 4194304

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:45:52 -07:00
Peter Oskolkov
2475f59c61 ipv6: discard IP frag queue on more errors
This is similar to how ipv4 now behaves:
commit 0ff89efb52 ("ip: fail fast on IP defrag errors").

Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:45:52 -07:00
Nathan Chancellor
8360ed6745 RDS: IB: Use DEFINE_PER_CPU_SHARED_ALIGNED for rds_ib_stats
Clang warns when two declarations' section attributes don't match.

net/rds/ib_stats.c:40:1: warning: section does not match previous
declaration [-Wsection]
DEFINE_PER_CPU_SHARED_ALIGNED(struct rds_ib_statistics, rds_ib_stats);
^
./include/linux/percpu-defs.h:142:2: note: expanded from macro
'DEFINE_PER_CPU_SHARED_ALIGNED'
        DEFINE_PER_CPU_SECTION(type, name,
PER_CPU_SHARED_ALIGNED_SECTION) \
        ^
./include/linux/percpu-defs.h:93:9: note: expanded from macro
'DEFINE_PER_CPU_SECTION'
        extern __PCPU_ATTRS(sec) __typeof__(type) name;
\
               ^
./include/linux/percpu-defs.h:49:26: note: expanded from macro
'__PCPU_ATTRS'
        __percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))
\
                                ^
net/rds/ib.h:446:1: note: previous attribute is here
DECLARE_PER_CPU(struct rds_ib_statistics, rds_ib_stats);
^
./include/linux/percpu-defs.h:111:2: note: expanded from macro
'DECLARE_PER_CPU'
        DECLARE_PER_CPU_SECTION(type, name, "")
        ^
./include/linux/percpu-defs.h:87:9: note: expanded from macro
'DECLARE_PER_CPU_SECTION'
        extern __PCPU_ATTRS(sec) __typeof__(type) name
               ^
./include/linux/percpu-defs.h:49:26: note: expanded from macro
'__PCPU_ATTRS'
        __percpu __attribute__((section(PER_CPU_BASE_SECTION sec)))
\
                                ^
1 warning generated.

The initial definition was added in commit ec16227e14 ("RDS/IB:
Infiniband transport") and the cache aligned definition was added in
commit e6babe4cc4 ("RDS/IB: Stats and sysctls") right after. The
definition probably should have been updated in net/rds/ib.h, which is
what this patch does.

Link: https://github.com/ClangBuiltLinux/linux/issues/114
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:44:07 -07:00
Eric Dumazet
075e264fa3 net/ipv4: avoid compile error in fib_info_nh_uses_dev
net/ipv4/fib_frontend.c: In function 'fib_info_nh_uses_dev':
net/ipv4/fib_frontend.c:322:6: error: unused variable 'ret' [-Werror=unused-variable]
cc1: all warnings being treated as errors

Fixes: 78f2756c5f ("net/ipv4: Move device validation to helper")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsahern@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:41:30 -07:00
Eric Dumazet
90caf67b01 net_sched: sch_fq: remove dead code dealing with retransmits
With the earliest departure time model, we no longer plan
special casing TCP retransmits. We therefore remove dead
code (since most compilers understood skb_is_retransmit()
was false)

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:38:00 -07:00
Eric Dumazet
c092dd5f4a tcp: switch tcp_internal_pacing() to tcp_wstamp_ns
Now TCP keeps track of tcp_wstamp_ns, recording the earliest
departure time of next packet, we can remove duplicate code
from tcp_internal_pacing()

This removes one ktime_get_tai_ns() call, and a divide.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:38:00 -07:00
Eric Dumazet
ab408b6dc7 tcp: switch tcp and sch_fq to new earliest departure time model
TCP keeps track of tcp_wstamp_ns by itself, meaning sch_fq
no longer has to do it.

Thanks to this model, TCP can get more accurate RTT samples,
since pacing no longer inflates them.

This has the nice effect of removing some delays caused by FQ
quantum mechanism, causing inflated max/P99 latencies.

Also we might relax TCP Small Queue tight limits in the future,
since this new model allow TCP to build bigger batches, since
sch_fq (or a device with earliest departure time offload) ensure
these packets will be delivered on time.

Note that other protocols are not converted (they will probably
never be) so sch_fq has still support for SO_MAX_PACING_RATE

Tested:

Test showing FQ pacing quantum artifact for low-rate flows,
adding unexpected throttles for RPC flows, inflating max and P99 latencies.

The parameters chosen here are to show what happens typically when
a TCP flow has a reduced pacing rate (this can be caused by a reduced
cwin after few losses, or/and rtt above few ms)

MIBS="MIN_LATENCY,MEAN_LATENCY,MAX_LATENCY,P99_LATENCY,STDDEV_LATENCY"
Before :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
 Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
19,82.78,5279,3825,482.02

After :
$ netperf -H 10.246.7.133 -t TCP_RR -Cc -T6,6 -- -q 2000000 -r 100,100 -o $MIBS
MIGRATED TCP REQUEST/RESPONSE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 10.246.7.133 () port 0 AF_INET : first burst 0 : cpu bind
Minimum Latency Microseconds,Mean Latency Microseconds,Maximum Latency Microseconds,99th Percentile Latency Microseconds,Stddev Latency Microseconds
20,49.94,128,63,3.18

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:38:00 -07:00
Eric Dumazet
fd2bca2aa7 tcp: switch internal pacing timer to CLOCK_TAI
Next patch will use tcp_wstamp_ns to feed internal
TCP pacing timer, so switch to CLOCK_TAI to share same base.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:37:59 -07:00
Eric Dumazet
d3edd06ea8 tcp: provide earliest departure time in skb->tstamp
Switch internal TCP skb->skb_mstamp to skb->skb_mstamp_ns,
from usec units to nsec units.

Do not clear skb->tstamp before entering IP stacks in TX,
so that qdisc or devices can implement pacing based on the
earliest departure time instead of socket sk->sk_pacing_rate

Packets are fed with tcp_wstamp_ns, and following patch
will update tcp_wstamp_ns when both TCP and sch_fq switch to
the earliest departure time mechanism.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:37:59 -07:00
Eric Dumazet
9799ccb0e9 tcp: add tcp_wstamp_ns socket field
TCP will soon provide earliest departure time on TX skbs.
It needs to track this in a new variable.

tcp_mstamp_refresh() needs to update this variable, and
became too big to stay an inline.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:37:59 -07:00
Eric Dumazet
142537e419 net_sched: sch_fq: switch to CLOCK_TAI
TCP will soon provide per skb->tstamp with earliest departure time,
so that sch_fq does not have to determine departure time by looking
at socket sk_pacing_rate.

We chose in linux-4.19 CLOCK_TAI as the clock base for transports,
qdiscs, and NIC offloads.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:37:59 -07:00
Eric Dumazet
2fd66ffba5 tcp: introduce tcp_skb_timestamp_us() helper
There are few places where TCP reads skb->skb_mstamp expecting
a value in usec unit.

skb->tstamp (aka skb->skb_mstamp) will soon store CLOCK_TAI nsec value.

Add tcp_skb_timestamp_us() to provide proper conversion when needed.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:37:59 -07:00
Dan Carpenter
83fe9a9661 devlink: double free in devlink_resource_fill()
Smatch reports that devlink_dpipe_send_and_alloc_skb() frees the skb
on error so this is a double free.  We fixed a bunch of these bugs in
commit 7fe4d6dcbc ("devlink: Remove redundant free on error path") but
we accidentally overlooked this one.

Fixes: d9f9b9a4d0 ("devlink: Add support for resource abstraction")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:19:07 -07:00
Vakul Garg
a42055e8d2 net/tls: Add support for async encryption of records for performance
In current implementation, tls records are encrypted & transmitted
serially. Till the time the previously submitted user data is encrypted,
the implementation waits and on finish starts transmitting the record.
This approach of encrypt-one record at a time is inefficient when
asynchronous crypto accelerators are used. For each record, there are
overheads of interrupts, driver softIRQ scheduling etc. Also the crypto
accelerator sits idle most of time while an encrypted record's pages are
handed over to tcp stack for transmission.

This patch enables encryption of multiple records in parallel when an
async capable crypto accelerator is present in system. This is achieved
by allowing the user space application to send more data using sendmsg()
even while previously issued data is being processed by crypto
accelerator. This requires returning the control back to user space
application after submitting encryption request to accelerator. This
also means that zero-copy mode of encryption cannot be used with async
accelerator as we must be done with user space application buffer before
returning from sendmsg().

There can be multiple records in flight to/from the accelerator. Each of
the record is represented by 'struct tls_rec'. This is used to store the
memory pages for the record.

After the records are encrypted, they are added in a linked list called
tx_ready_list which contains encrypted tls records sorted as per tls
sequence number. The records from tx_ready_list are transmitted using a
newly introduced function called tls_tx_records(). The tx_ready_list is
polled for any record ready to be transmitted in sendmsg(), sendpage()
after initiating encryption of new tls records. This achieves parallel
encryption and transmission of records when async accelerator is
present.

There could be situation when crypto accelerator completes encryption
later than polling of tx_ready_list by sendmsg()/sendpage(). Therefore
we need a deferred work context to be able to transmit records from
tx_ready_list. The deferred work context gets scheduled if applications
are not sending much data through the socket. If the applications issue
sendmsg()/sendpage() in quick succession, then the scheduling of
tx_work_handler gets cancelled as the tx_ready_list would be polled from
application's context itself. This saves scheduling overhead of deferred
work.

The patch also brings some side benefit. We are able to get rid of the
concept of CLOSED record. This is because the records once closed are
either encrypted and then placed into tx_ready_list or if encryption
fails, the socket error is set. This simplifies the kernel tls
sendpath. However since tls_device.c is still using macros, accessory
functions for CLOSED records have been retained.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:17:34 -07:00
Jeff Barnhill
86f9bd1ff6 net/ipv6: Display all addresses in output of /proc/net/if_inet6
The backend handling for /proc/net/if_inet6 in addrconf.c doesn't properly
handle starting/stopping the iteration.  The problem is that at some point
during the iteration, an overflow is detected and the process is
subsequently stopped.  The item being shown via seq_printf() when the
overflow occurs is not actually shown, though.  When start() is
subsequently called to resume iterating, it returns the next item, and
thus the item that was being processed when the overflow occurred never
gets printed.

Alter the meaning of the private data member "offset".  Currently, when it
is not 0 (which only happens at the very beginning), "offset" represents
the next hlist item to be printed.  After this change, "offset" always
represents the current item.

This is also consistent with the private data member "bucket", which
represents the current bucket, and also the use of "pos" as defined in
seq_file.txt:
    The pos passed to start() will always be either zero, or the most
    recent pos used in the previous session.

Signed-off-by: Jeff Barnhill <0xeffeff@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 19:10:49 -07:00
Sean Tranchetti
f88b4c01b9 netlabel: check for IPV4MASK in addrinfo_get
netlbl_unlabel_addrinfo_get() assumes that if it finds the
NLBL_UNLABEL_A_IPV4ADDR attribute, it must also have the
NLBL_UNLABEL_A_IPV4MASK attribute as well. However, this is
not necessarily the case as the current checks in
netlbl_unlabel_staticadd() and friends are not sufficent to
enforce this.

If passed a netlink message with NLBL_UNLABEL_A_IPV4ADDR,
NLBL_UNLABEL_A_IPV6ADDR, and NLBL_UNLABEL_A_IPV6MASK attributes,
these functions will all call netlbl_unlabel_addrinfo_get() which
will then attempt dereference NULL when fetching the non-existent
NLBL_UNLABEL_A_IPV4MASK attribute:

Unable to handle kernel NULL pointer dereference at virtual address 0
Process unlab (pid: 31762, stack limit = 0xffffff80502d8000)
Call trace:
	netlbl_unlabel_addrinfo_get+0x44/0xd8
	netlbl_unlabel_staticremovedef+0x98/0xe0
	genl_rcv_msg+0x354/0x388
	netlink_rcv_skb+0xac/0x118
	genl_rcv+0x34/0x48
	netlink_unicast+0x158/0x1f0
	netlink_sendmsg+0x32c/0x338
	sock_sendmsg+0x44/0x60
	___sys_sendmsg+0x1d0/0x2a8
	__sys_sendmsg+0x64/0xb4
	SyS_sendmsg+0x34/0x4c
	el0_svc_naked+0x34/0x38
Code: 51001149 7100113f 540000a0 f9401508 (79400108)
---[ end trace f6438a488e737143 ]---
Kernel panic - not syncing: Fatal exception

Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 18:58:34 -07:00
zhong jiang
f2a2f2166a ipv6: remove redundant null pointer check before kfree_skb
kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 09:04:37 -07:00
zhong jiang
e3c3215e41 net: nci: remove redundant null pointer check before kfree_skb
kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 09:04:37 -07:00
zhong jiang
1d08962ff1 ipv4: remove redundant null pointer check before kfree_skb
kfree_skb has taken the null pointer into account. hence it is safe
to remove the redundant null pointer check before kfree_skb.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 09:04:37 -07:00
Vlad Buslov
ec3ed293e7 net_sched: change tcf_del_walker() to take idrinfo->lock
Action API was changed to work with actions and action_idr in concurrency
safe manner, however tcf_del_walker() still uses actions without taking a
reference or idrinfo->lock first, and deletes them directly, disregarding
possible concurrent delete.

Change tcf_del_walker() to take idrinfo->lock while iterating over actions
and use new tcf_idr_release_unsafe() to release them while holding the
lock.

And the blocking function fl_hw_destroy_tmplt() could be called when we
put a filter chain, so defer it to a work queue.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
[xiyou.wangcong@gmail.com: heavily modify the code and changelog]
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-21 08:55:05 -07:00
Florian Westphal
9306425b70 netfilter: ctnetlink: must check mark attributes vs NULL
else we will oops (null deref) when the attributes aren't present.

Also add back the EOPNOTSUPP in case MARK filtering is requested but
kernel doesn't support it.

Fixes: 59c08c69c2 ("netfilter: ctnetlink: Support L3 protocol-filter on flush")
Reported-by: syzbot+e45eda8eda6e93a03959@syzkaller.appspotmail.com
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-21 10:14:46 +02:00
David Ahern
9f18b6b68e netfilter: nft_fib: Convert nft_fib4_eval to new dev helper
Convert nft_fib4_eval to the new device checking helper and
remove the duplicate code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20 20:01:53 -07:00
David Ahern
91a178258a netfilter: rpfilter: Convert rpfilter_lookup_reverse to new dev helper
Convert rpfilter_lookup_reverse to the new device checking helper
and remove the duplicate code.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20 20:01:52 -07:00
David Ahern
78f2756c5f net/ipv4: Move device validation to helper
Move the device matching check in __fib_validate_source to a helper and
export it for use by netfilter modules. Code move only; no functional
change intended.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20 20:01:52 -07:00
Xin Long
d7ab5cdce5 sctp: update dst pmtu with the correct daddr
When processing pmtu update from an icmp packet, it calls .update_pmtu
with sk instead of skb in sctp_transport_update_pmtu.

However for sctp, the daddr in the transport might be different from
inet_sock->inet_daddr or sk->sk_v6_daddr, which is used to update or
create the route cache. The incorrect daddr will cause a different
route cache created for the path.

So before calling .update_pmtu, inet_sock->inet_daddr/sk->sk_v6_daddr
should be updated with the daddr in the transport, and update it back
after it's done.

The issue has existed since route exceptions introduction.

Fixes: 4895c771c7 ("ipv4: Add FIB nexthop exceptions.")
Reported-by: ian.periam@dialogic.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20 11:29:30 -07:00
Mauro Carvalho Chehab
c03e2fa753 docs: fix some broken documentation references
Some documentation files received recent changes and are
pointing to wrong places.

Those references can easily fixed with the help of a
script:

	$ ./scripts/documentation-file-ref-check --fix

Signed-off-by: Mauro Carvalho Chehab <mchehab+samsung@kernel.org>
Reviewed-by: Jan Kara <jack@suse.cz>
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
Signed-off-by: Jonathan Corbet <corbet@lwn.net>
2018-09-20 11:09:53 -06:00
zhong jiang
346fa83d10 netfilter: conntrack: get rid of double sizeof
sizeof(sizeof()) is quite strange and does not seem to be what
is wanted here.

The issue is detected with the help of Coccinelle.

Fixes: 3921584674 ("netfilter: conntrack: remove nlattr_size pointer from l4proto trackers")
Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:40:32 +02:00
Stefan Agner
bab4344975 netfilter: nft_osf: use enum nft_data_types for nft_validate_register_store
The function nft_validate_register_store requires a struct of type
struct nft_data_types. NFTA_DATA_VALUE is of type enum
nft_verdict_attributes. Pass the correct enum type.

This fixes a warning seen with Clang:
  net/netfilter/nft_osf.c:52:8: warning: implicit conversion from
    enumeration type 'enum nft_data_attributes' to different enumeration
    type 'enum nft_data_types' [-Wenum-conversion]
                                          NFTA_DATA_VALUE, NFT_OSF_MAXGENRELEN);
                                          ^~~~~~~~~~~~~~~

Fixes: b96af92d6e ("netfilter: nf_tables: implement Passive OS fingerprint module in nft_osf")
Link: https://github.com/ClangBuiltLinux/linux/issues/103
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:38:40 +02:00
Taehee Yoo
0de22baabc netfilter: nf_tables: use rhashtable_walk_enter instead of rhashtable_walk_init
rhashtable_walk_init() is deprecated and rhashtable_walk_enter() can be
used instead. rhashtable_walk_init() is wrapper function of
rhashtable_walk_enter() so that logic is actually same.
But rhashtable_walk_enter() doesn't return error hence error path
code can be removed.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:26:56 +02:00
Taehee Yoo
f8b0a3ab06 netfilter: nat: remove duplicate skb_is_nonlinear() in __nf_nat_mangle_tcp_packet()
__nf_nat_mangle_tcp_packet() and nf_nat_mangle_udp_packet() call
mangle_contents(). and __nf_nat_mangle_tcp_packet()
and mangle_contents() call skb_is_nonlinear(). so that
skb_is_nonlinear() in __nf_nat_mangle_tcp_packet() is unnecessary.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:26:56 +02:00
David Ahern
a173f066c7 netfilter: bridge: Don't sabotage nf_hook calls from an l3mdev
For starters, the bridge netfilter code registers operations that
are invoked any time nh_hook is called. Specifically, ip_sabotage_in
watches for nested calls for NF_INET_PRE_ROUTING when a bridge is in
the stack.

Packet wise, the bridge netfilter hook runs first. br_nf_pre_routing
allocates nf_bridge, sets in_prerouting to 1 and calls NF_HOOK for
NF_INET_PRE_ROUTING. It's finish function, br_nf_pre_routing_finish,
then resets in_prerouting flag to 0 and the packet continues up the
stack. The packet eventually makes it to the VRF driver and it invokes
nf_hook for NF_INET_PRE_ROUTING in case any rules have been added against
the vrf device.

Because of the registered operations the call to nf_hook causes
ip_sabotage_in to be invoked. That function sees the nf_bridge on the
skb and that in_prerouting is not set. Thinking it is an invalid nested
call it steals (drops) the packet.

Update ip_sabotage_in to recognize that the bridge or one of its upper
devices (e.g., vlan) can be enslaved to a VRF (L3 master device) and
allow the packet to go through the nf_hook a second time.

Fixes: 73e20b761a ("net: vrf: Add support for PREROUTING rules on vrf device")
Reported-by: D'Souza, Nelson <ndsouza@ciena.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:24:14 +02:00
Florian Westphal
93185c80a5 netfilter: conntrack: clamp l4proto array size at largers supported protocol
All higher l4proto numbers are handled by the generic tracker; the
l4proto lookup function already returns generic one in case the l4proto
number exceeds max size.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:08:14 +02:00
Florian Westphal
dd2934a957 netfilter: conntrack: remove l3->l4 mapping information
l4 protocols are demuxed by l3num, l4num pair.

However, almost all l4 trackers are l3 agnostic.

Only exceptions are:
 - gre, icmp (ipv4 only)
 - icmpv6 (ipv6 only)

This commit gets rid of the l3 mapping, l4 trackers can now be looked up
by their IPPROTO_XXX value alone, which gets rid of the additional l3
indirection.

For icmp, ipcmp6 and gre, add a check on state->pf and
return -NF_ACCEPT in case we're asked to track e.g. icmpv6-in-ipv4,
this seems more fitting than using the generic tracker.

Additionally we can kill the 2nd l4proto definitions that were needed
for v4/v6 split -- they are now the same so we can use single l4proto
struct for each protocol, rather than two.

The EXPORT_SYMBOLs can be removed as all these object files are
part of nf_conntrack with no external references.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:07:35 +02:00
Florian Westphal
ca2ca6e1c0 netfilter: conntrack: remove unused proto arg from netns init functions
Its unused, next patch will remove l4proto->l3proto number to simplify
l4 protocol demuxer lookup.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:03:50 +02:00
Florian Westphal
6fe78fa484 netfilter: conntrack: remove error callback and handle icmp from core
icmp(v6) are the only two layer four protocols that need the error()
callback (to handle icmp errors that are related to an established
connections, e.g. packet too big, port unreachable and the like).

Remove the error callback and handle these two special cases from the core.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:02:57 +02:00
Florian Westphal
0150ffbac7 netfilter: conntrack: avoid using ->error callback if possible
The error() handler gets called before allocating or looking up a
connection tracking entry.

We can instead use direct calls from the ->packet() handlers which get
invoked for every packet anyway.

Only exceptions are icmp and icmpv6, these two special cases will be
handled in the next patch.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:02:42 +02:00
Florian Westphal
83d213fd9d netfilter: conntrack: deconstify packet callback skb pointer
Only two protocols need the ->error() function: icmp and icmpv6.
This is because icmp error mssages might be RELATED to an existing
connection (e.g. PMTUD, port unreachable and the like), and their
->error() handlers do this.

The error callback is already optional, so remove it for
udp and call them from ->packet() instead.

As the error() callback can call checksum functions that write to
skb->csum*, the const qualifier has to be removed as well.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 18:02:22 +02:00
Florian Westphal
9976fc6e6e netfilter: conntrack: remove the l4proto->new() function
->new() gets invoked after ->error() and before ->packet() if
a conntrack lookup has found no result for the tuple.

We can fold it into ->packet() -- the packet() implementations
can check if the conntrack is confirmed (new) or not
(already in hash).

If its unconfirmed, the conntrack isn't in the hash yet so current
skb created a new conntrack entry.

Only relevant side effect -- if packet() doesn't return NF_ACCEPT
but -NF_ACCEPT (or drop), while the conntrack was just created,
then the newly allocated conntrack is freed right away, rather than not
created in the first place.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 17:57:17 +02:00
Florian Westphal
93e66024b0 netfilter: conntrack: pass nf_hook_state to packet and error handlers
nf_hook_state contains all the hook meta-information: netns, protocol family,
hook location, and so on.

Instead of only passing selected information, pass a pointer to entire
structure.

This will allow to merge the error and the packet handlers and remove
the ->new() function in followup patches.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-20 17:54:37 +02:00
Johannes Berg
56ce3c5a50 smc: generic netlink family should be __ro_after_init
The generic netlink family is only initialized during module init,
so it should be __ro_after_init like all other generic netlink
families.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-20 07:49:55 -07:00
Sean Tranchetti
32bf94fb5c xfrm: validate template mode
XFRM mode parameters passed as part of the user templates
in the IP_XFRM_POLICY are never properly validated. Passing
values other than valid XFRM modes can cause stack-out-of-bounds
reads to occur later in the XFRM processing:

[  140.535608] ================================================================
[  140.543058] BUG: KASAN: stack-out-of-bounds in xfrm_state_find+0x17e4/0x1cc4
[  140.550306] Read of size 4 at addr ffffffc0238a7a58 by task repro/5148
[  140.557369]
[  140.558927] Call trace:
[  140.558936] dump_backtrace+0x0/0x388
[  140.558940] show_stack+0x24/0x30
[  140.558946] __dump_stack+0x24/0x2c
[  140.558949] dump_stack+0x8c/0xd0
[  140.558956] print_address_description+0x74/0x234
[  140.558960] kasan_report+0x240/0x264
[  140.558963] __asan_report_load4_noabort+0x2c/0x38
[  140.558967] xfrm_state_find+0x17e4/0x1cc4
[  140.558971] xfrm_resolve_and_create_bundle+0x40c/0x1fb8
[  140.558975] xfrm_lookup+0x238/0x1444
[  140.558977] xfrm_lookup_route+0x48/0x11c
[  140.558984] ip_route_output_flow+0x88/0xc4
[  140.558991] raw_sendmsg+0xa74/0x266c
[  140.558996] inet_sendmsg+0x258/0x3b0
[  140.559002] sock_sendmsg+0xbc/0xec
[  140.559005] SyS_sendto+0x3a8/0x5a8
[  140.559008] el0_svc_naked+0x34/0x38
[  140.559009]
[  140.592245] page dumped because: kasan: bad access detected
[  140.597981] page_owner info is not active (free page?)
[  140.603267]
[  140.653503] ================================================================

Signed-off-by: Sean Tranchetti <stranche@codeaurora.org>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-09-20 08:30:42 +02:00
Paolo Abeni
76c0ddd8c3 ip6_tunnel: be careful when accessing the inner header
the ip6 tunnel xmit ndo assumes that the processed skb always
contains an ip[v6] header, but syzbot has found a way to send
frames that fall short of this assumption, leading to the following splat:

BUG: KMSAN: uninit-value in ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1307
[inline]
BUG: KMSAN: uninit-value in ip6_tnl_start_xmit+0x7d2/0x1ef0
net/ipv6/ip6_tunnel.c:1390
CPU: 0 PID: 4504 Comm: syz-executor558 Not tainted 4.16.0+ #87
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
Google 01/01/2011
Call Trace:
  __dump_stack lib/dump_stack.c:17 [inline]
  dump_stack+0x185/0x1d0 lib/dump_stack.c:53
  kmsan_report+0x142/0x240 mm/kmsan/kmsan.c:1067
  __msan_warning_32+0x6c/0xb0 mm/kmsan/kmsan_instr.c:683
  ip6ip6_tnl_xmit net/ipv6/ip6_tunnel.c:1307 [inline]
  ip6_tnl_start_xmit+0x7d2/0x1ef0 net/ipv6/ip6_tunnel.c:1390
  __netdev_start_xmit include/linux/netdevice.h:4066 [inline]
  netdev_start_xmit include/linux/netdevice.h:4075 [inline]
  xmit_one net/core/dev.c:3026 [inline]
  dev_hard_start_xmit+0x5f1/0xc70 net/core/dev.c:3042
  __dev_queue_xmit+0x27ee/0x3520 net/core/dev.c:3557
  dev_queue_xmit+0x4b/0x60 net/core/dev.c:3590
  packet_snd net/packet/af_packet.c:2944 [inline]
  packet_sendmsg+0x7c70/0x8a30 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmmsg+0x42d/0x800 net/socket.c:2136
  SYSC_sendmmsg+0xc4/0x110 net/socket.c:2167
  SyS_sendmmsg+0x63/0x90 net/socket.c:2162
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2
RIP: 0033:0x441819
RSP: 002b:00007ffe58ee8268 EFLAGS: 00000213 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 0000000000441819
RDX: 0000000000000002 RSI: 0000000020000100 RDI: 0000000000000003
RBP: 00000000006cd018 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000213 R12: 0000000000402510
R13: 00000000004025a0 R14: 0000000000000000 R15: 0000000000000000

Uninit was created at:
  kmsan_save_stack_with_flags mm/kmsan/kmsan.c:278 [inline]
  kmsan_internal_poison_shadow+0xb8/0x1b0 mm/kmsan/kmsan.c:188
  kmsan_kmalloc+0x94/0x100 mm/kmsan/kmsan.c:314
  kmsan_slab_alloc+0x11/0x20 mm/kmsan/kmsan.c:321
  slab_post_alloc_hook mm/slab.h:445 [inline]
  slab_alloc_node mm/slub.c:2737 [inline]
  __kmalloc_node_track_caller+0xaed/0x11c0 mm/slub.c:4369
  __kmalloc_reserve net/core/skbuff.c:138 [inline]
  __alloc_skb+0x2cf/0x9f0 net/core/skbuff.c:206
  alloc_skb include/linux/skbuff.h:984 [inline]
  alloc_skb_with_frags+0x1d4/0xb20 net/core/skbuff.c:5234
  sock_alloc_send_pskb+0xb56/0x1190 net/core/sock.c:2085
  packet_alloc_skb net/packet/af_packet.c:2803 [inline]
  packet_snd net/packet/af_packet.c:2894 [inline]
  packet_sendmsg+0x6454/0x8a30 net/packet/af_packet.c:2969
  sock_sendmsg_nosec net/socket.c:630 [inline]
  sock_sendmsg net/socket.c:640 [inline]
  ___sys_sendmsg+0xec0/0x1310 net/socket.c:2046
  __sys_sendmmsg+0x42d/0x800 net/socket.c:2136
  SYSC_sendmmsg+0xc4/0x110 net/socket.c:2167
  SyS_sendmmsg+0x63/0x90 net/socket.c:2162
  do_syscall_64+0x309/0x430 arch/x86/entry/common.c:287
  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

This change addresses the issue adding the needed check before
accessing the inner header.

The ipv4 side of the issue is apparently there since the ipv4 over ipv6
initial support, and the ipv6 side predates git history.

Fixes: c4d3efafcc ("[IPV6] IP6TUNNEL: Add support to IPv4 over IPv6 tunnel.")
Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: syzbot+3fde91d4d394747d6db4@syzkaller.appspotmail.com
Tested-by: Alexander Potapenko <glider@google.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 21:24:28 -07:00
Robert Shearman
3ede0bbcdf ipv6: Allow the l3mdev to be a loopback
There is no way currently for an IPv6 client connect using a loopback
address in a VRF, whereas for IPv4 the loopback address can be added:

    $ sudo ip addr add dev vrfred 127.0.0.1/8
    $ sudo ip -6 addr add ::1/128 dev vrfred
    RTNETLINK answers: Cannot assign requested address

So allow ::1 to be configured on an L3 master device. In order for
this to be usable ip_route_output_flags needs to not consider ::1 to
be a link scope address (since oif == l3mdev and so it would be
dropped), and ipv6_rcv needs to consider the l3mdev to be a loopback
device so that it doesn't drop the packets.

Signed-off-by: Robert Shearman <rshearma@vyatta.att-mail.com>
Signed-off-by: Mike Manning <mmanning@vyatta.att-mail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 21:23:01 -07:00
Heiner Kallweit
124eee3f69 net: linkwatch: add check for netdevice being present to linkwatch_do_dev
When bringing down the netdevice (incl. detaching it) and calling
netif_carrier_off directly or indirectly the latter triggers an
asynchronous linkwatch event.
This linkwatch event eventually may fail to access chip registers in
the ndo_get_stats/ndo_get_stats64 callback because the device isn't
accessible any longer, see call trace in [0].

To prevent this scenario don't check for IFF_UP only, but also make
sure that the netdevice is present.

[0] https://lists.openwall.net/netdev/2018/03/15/62

Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Tested-by: Geert Uytterhoeven <geert+renesas@glider.be>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 21:06:46 -07:00
zhong jiang
f195efb47d net: core: Use FIELD_SIZEOF directly instead of reimplementing its function
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore,
We prefer to use the macro rather than calculating its value.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 20:58:04 -07:00
zhong jiang
cb205a8174 net: sched: Use FIELD_SIZEOF directly instead of reimplementing its function
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore,
We prefer to use the macro rather than calculating its value.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 20:58:04 -07:00
zhong jiang
86c85e1fc4 net: iucv: Use FIELD_SIZEOF directly instead of reimplementing its function
FIELD_SIZEOF is defined as a macro to calculate the specified value. Therefore,
We prefer to use the macro rather than calculating its value.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 20:58:04 -07:00
David S. Miller
18522108d5 This feature/cleanup patchset includes the following patches:
- bump version strings, by Simon Wunderlich
 
  - Inform users about debugfs interface deprecation, by Sven Eckelmann
 
  - Implement tracing, planned to replace debugfs log messages,
    by Sven Eckelmann
 
  - Move OGM rebroadcasts to per interface struct, by Sven Eckelmann
 
  - Enable LockLess TX to increase throughput, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAluiQXoWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoTreD/9jxzBgB7PXGPJieaGDkVrmgEgO
 mC9PHvDjL58iFAUuWA+OYLx5oW1e3u+YrDBJsd/zD4+L7RbyJYoX546A2X2kpccj
 1yiAbLLs6oYk4vnxIWEnaz2pWJXjh+wRADKcnO1LkMnY4MatuLRDPKRib9FRwq7n
 XvOIg0Md6Wzds4d8dM1tBz3KsZJiP4I5NcvhMzYmsFNl51JJpcECG6KweBV3qrmY
 ZpG/vmUo37D0xv4/15TpL5aj6EpC8nxUp8XJqq0zRnvOx/2VgaKz5Qb8dewN/pUt
 rxTVtBlJX/EXCU+m81mteEhGii9DhTzKXhkho/eJKdsxV2EAf1qxr4/Bbz6ciGs4
 SJuAq0q5owtr/hgAihUXAC0fnOZOw1xOjt2B+snRY4VjoPIg8kr8s4GXupqqFCXY
 DwnWSvt37WZg0D3N+jXU2eNdzdZqm/pkU1yvUUpRd5KNFHHvM+yw+BoeEf8rMcuY
 AefU3ZSE6cOt5MKVozDFwxF24hBTd7CNgi4uD+SWOTtUc7Yj0xIJyY7Yle6q5p3f
 wX77RmS2VKLrxyhFbxGNsNDMiVTqAEJQglmuy3BACy7YVJz07lSUSGbhVowLGk74
 tkgOS5hwVcDgq3o8fdVgTGaj2T5WPf1FDFdZ+g4QvqjTKm0pLHWKc8KEsSerUZfw
 Fjm8DrB3KCXvt6uD+g==
 =XZYf
 -----END PGP SIGNATURE-----

Merge tag 'batadv-next-for-davem-20180919' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
This feature/cleanup patchset includes the following patches:

 - bump version strings, by Simon Wunderlich

 - Inform users about debugfs interface deprecation, by Sven Eckelmann

 - Implement tracing, planned to replace debugfs log messages,
   by Sven Eckelmann

 - Move OGM rebroadcasts to per interface struct, by Sven Eckelmann

 - Enable LockLess TX to increase throughput, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 20:35:37 -07:00
David S. Miller
aa86b03c3e Here are some batman-adv bugfixes:
- Avoid ELP information leak, by Sven Eckelmann
 
  - Fix sysfs segfault issues, by Sven Eckelmann (2 patches)
 
  - Fix locking when adding entries in various lists,
    by Sven Eckelmann (5 patches)
 
  - Fix refcount if queue_work() fails, by Marek Lindner (2 patches)
 
  - Fixup forgotten version bump, by Sven Eckelmann
 -----BEGIN PGP SIGNATURE-----
 
 iQJKBAABCgA0FiEE1ilQI7G+y+fdhnrfoSvjmEKSnqEFAluiO2gWHHN3QHNpbW9u
 d3VuZGVybGljaC5kZQAKCRChK+OYQpKeoRrcD/9F8CZnQusMuvLe/8sxWiyb70zx
 Fx8dsZS7ppMN4SoQU5dB9NbgDwzDdK1L42Fgzm2CfgeEIYzUOsw6l/+KFuXQSU/t
 Sb/NloyFNQ5nGHO6IcLY8Mqa6bMX/6dlBinRcuTyXOECqqBAaYTf56olNyUty3K6
 9BHfyule+oZtQ5aN6DhOsVB0ATxGoqJK2lbAePj+mzb+5253YykUhjAaEVGM0oNN
 3ULq34dyPxPb2t/z0qNK8CJ2+RXPrAqssEaD0X0awl7spFJYt4jvAXwKvHnR/ryo
 6EDrvnd38bcs+DJrNHhycyk+bITT6njnTv4kJtQuFAd+eap5h7RR5mAElSBgWSAB
 Xy0HjwNY/mb/OMY85/o2EjLOrmr31YpJcLWI6sLqPrUbD4WGhoW1ZB71AQCL3nHL
 6YLln+mYCK00KT5hPU9MzO/3nGyd3uvRjKLLd42chC+YZXxvhWFFEkpsEGYJZq/q
 qduWytS7+W+FSUrDEgTePJh6rdwAbFLSUWqW5bBxA5ATctxxbsslF8IvUJy6fCel
 Sp+iNIkjfp+eB8jFaacrrw4At1YE2nDJlyycMvLQvQOaSCtpGuKiTqsqTHFbgSy4
 Y7QTHNLLbguiASCuhVvP5EYsBm8ZqNUjXMJZCHOCVaXp/cQ3K7Z6yC2pEst2QJ6m
 9lrPOg32Rfo9dRNcTA==
 =B1qZ
 -----END PGP SIGNATURE-----

Merge tag 'batadv-net-for-davem-20180919' of git://git.open-mesh.org/linux-merge

Simon Wunderlich says:

====================
pull request for net: batman-adv 2018-09-19

here are some bugfixes which we would like to see integrated into net.

We forgot to bump the version number in the last round for net-next, so
the belated patch to do that is included - we hope you can adopt it.
This will most likely create a merge conflict later when merging into
net-next with this rounds net-next patchset, but net-next should keep
the 2018.4 version[1].

[1] resolution:

--- a/net/batman-adv/main.h
+++ b/net/batman-adv/main.h
@@ -25,11 +25,7 @@
 #define BATADV_DRIVER_DEVICE "batman-adv"

 #ifndef BATADV_SOURCE_VERSION
-<<<<<<<
-#define BATADV_SOURCE_VERSION "2018.3"
-=======
 #define BATADV_SOURCE_VERSION "2018.4"
->>>>>>>
 #endif

 /* B.A.T.M.A.N. parameters */

Please pull or let me know of any problem!

Here are some batman-adv bugfixes:

 - Avoid ELP information leak, by Sven Eckelmann

 - Fix sysfs segfault issues, by Sven Eckelmann (2 patches)

 - Fix locking when adding entries in various lists,
   by Sven Eckelmann (5 patches)

 - Fix refcount if queue_work() fails, by Marek Lindner (2 patches)

 - Fixup forgotten version bump, by Sven Eckelmann
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-19 20:35:14 -07:00
Wei Wang
ce7ea4af08 ipv6: fix memory leak on dst->_metrics
When dst->_metrics and f6i->fib6_metrics share the same memory, both
take reference count on the dst_metrics structure. However, when dst is
destroyed, ip6_dst_destroy() only invokes dst_destroy_metrics_generic()
which does not take care of READONLY metrics and does not release refcnt.
This causes memory leak.
Similar to ipv4 logic, the fix is to properly release refcnt and free
the memory space pointed by dst->_metrics if refcnt becomes 0.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:17:01 -07:00
Wei Wang
8675860592 Revert "ipv6: fix double refcount of fib6_metrics"
This reverts commit e70a3aad44.

This change causes use-after-free on dst->_metrics.
The crash trace looks like this:
[   97.763269] BUG: KASAN: use-after-free in ip6_mtu+0x116/0x140
[   97.769038] Read of size 4 at addr ffff881781d2cf84 by task svw_NetThreadEv/8801

[   97.777954] CPU: 76 PID: 8801 Comm: svw_NetThreadEv Not tainted 4.15.0-smp-DEV #11
[   97.777956] Hardware name: Default string Default string/Indus_QC_02, BIOS 5.46.4 03/29/2018
[   97.777957] Call Trace:
[   97.777971]  [<ffffffff895709db>] dump_stack+0x4d/0x72
[   97.777985]  [<ffffffff881651df>] print_address_description+0x6f/0x260
[   97.777997]  [<ffffffff88165747>] kasan_report+0x257/0x370
[   97.778001]  [<ffffffff894488e6>] ? ip6_mtu+0x116/0x140
[   97.778004]  [<ffffffff881658b9>] __asan_report_load4_noabort+0x19/0x20
[   97.778008]  [<ffffffff894488e6>] ip6_mtu+0x116/0x140
[   97.778013]  [<ffffffff892bb91e>] tcp_current_mss+0x12e/0x280
[   97.778016]  [<ffffffff892bb7f0>] ? tcp_mtu_to_mss+0x2d0/0x2d0
[   97.778022]  [<ffffffff887b45b8>] ? depot_save_stack+0x138/0x4a0
[   97.778037]  [<ffffffff87c38985>] ? __mmdrop+0x145/0x1f0
[   97.778040]  [<ffffffff881643b1>] ? save_stack+0xb1/0xd0
[   97.778046]  [<ffffffff89264c82>] tcp_send_mss+0x22/0x220
[   97.778059]  [<ffffffff89273a49>] tcp_sendmsg_locked+0x4f9/0x39f0
[   97.778062]  [<ffffffff881642b4>] ? kasan_check_write+0x14/0x20
[   97.778066]  [<ffffffff89273550>] ? tcp_sendpage+0x60/0x60
[   97.778070]  [<ffffffff881cb359>] ? rw_copy_check_uvector+0x69/0x280
[   97.778075]  [<ffffffff8873c65f>] ? import_iovec+0x9f/0x430
[   97.778078]  [<ffffffff88164be7>] ? kasan_slab_free+0x87/0xc0
[   97.778082]  [<ffffffff8873c5c0>] ? memzero_page+0x140/0x140
[   97.778085]  [<ffffffff881642b4>] ? kasan_check_write+0x14/0x20
[   97.778088]  [<ffffffff89276f6c>] tcp_sendmsg+0x2c/0x50
[   97.778092]  [<ffffffff89276f6c>] ? tcp_sendmsg+0x2c/0x50
[   97.778098]  [<ffffffff89352d43>] inet_sendmsg+0x103/0x480
[   97.778102]  [<ffffffff89352c40>] ? inet_gso_segment+0x15b0/0x15b0
[   97.778105]  [<ffffffff890294da>] sock_sendmsg+0xba/0xf0
[   97.778108]  [<ffffffff8902ab6a>] ___sys_sendmsg+0x6ca/0x8e0
[   97.778113]  [<ffffffff87dccac1>] ? hrtimer_try_to_cancel+0x71/0x3b0
[   97.778116]  [<ffffffff8902a4a0>] ? copy_msghdr_from_user+0x3d0/0x3d0
[   97.778119]  [<ffffffff881646d1>] ? memset+0x31/0x40
[   97.778123]  [<ffffffff87a0cff5>] ? schedule_hrtimeout_range_clock+0x165/0x380
[   97.778127]  [<ffffffff87a0ce90>] ? hrtimer_nanosleep_restart+0x250/0x250
[   97.778130]  [<ffffffff87dcc700>] ? __hrtimer_init+0x180/0x180
[   97.778133]  [<ffffffff87dd1f82>] ? ktime_get_ts64+0x172/0x200
[   97.778137]  [<ffffffff8822b8ec>] ? __fget_light+0x8c/0x2f0
[   97.778141]  [<ffffffff8902d5c6>] __sys_sendmsg+0xe6/0x190
[   97.778144]  [<ffffffff8902d5c6>] ? __sys_sendmsg+0xe6/0x190
[   97.778147]  [<ffffffff8902d4e0>] ? SyS_shutdown+0x20/0x20
[   97.778152]  [<ffffffff87cd4370>] ? wake_up_q+0xe0/0xe0
[   97.778155]  [<ffffffff8902d670>] ? __sys_sendmsg+0x190/0x190
[   97.778158]  [<ffffffff8902d683>] SyS_sendmsg+0x13/0x20
[   97.778162]  [<ffffffff87a1600c>] do_syscall_64+0x2ac/0x430
[   97.778166]  [<ffffffff87c17515>] ? do_page_fault+0x35/0x3d0
[   97.778171]  [<ffffffff8960131f>] ? page_fault+0x2f/0x50
[   97.778174]  [<ffffffff89600071>] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[   97.778177] RIP: 0033:0x7f83fa36000d
[   97.778178] RSP: 002b:00007f83ef9229e0 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[   97.778180] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f83fa36000d
[   97.778182] RDX: 0000000000004000 RSI: 00007f83ef922f00 RDI: 0000000000000036
[   97.778183] RBP: 00007f83ef923040 R08: 00007f83ef9231f8 R09: 00007f83ef923168
[   97.778184] R10: 0000000000000000 R11: 0000000000000293 R12: 00007f83f69c5b40
[   97.778185] R13: 000000000000001c R14: 0000000000000001 R15: 0000000000004000

[   97.779684] Allocated by task 5919:
[   97.783185]  save_stack+0x46/0xd0
[   97.783187]  kasan_kmalloc+0xad/0xe0
[   97.783189]  kmem_cache_alloc_trace+0xdf/0x580
[   97.783190]  ip6_convert_metrics.isra.79+0x7e/0x190
[   97.783192]  ip6_route_info_create+0x60a/0x2480
[   97.783193]  ip6_route_add+0x1d/0x80
[   97.783195]  inet6_rtm_newroute+0xdd/0xf0
[   97.783198]  rtnetlink_rcv_msg+0x641/0xb10
[   97.783200]  netlink_rcv_skb+0x27b/0x3e0
[   97.783202]  rtnetlink_rcv+0x15/0x20
[   97.783203]  netlink_unicast+0x4be/0x720
[   97.783204]  netlink_sendmsg+0x7bc/0xbf0
[   97.783205]  sock_sendmsg+0xba/0xf0
[   97.783207]  ___sys_sendmsg+0x6ca/0x8e0
[   97.783208]  __sys_sendmsg+0xe6/0x190
[   97.783209]  SyS_sendmsg+0x13/0x20
[   97.783211]  do_syscall_64+0x2ac/0x430
[   97.783213]  entry_SYSCALL_64_after_hwframe+0x3d/0xa2

[   97.784709] Freed by task 0:
[   97.785056] knetbase: Error: /proc/sys/net/core/txcs_enable does not exist
[   97.794497]  save_stack+0x46/0xd0
[   97.794499]  kasan_slab_free+0x71/0xc0
[   97.794500]  kfree+0x7c/0xf0
[   97.794501]  fib6_info_destroy_rcu+0x24f/0x310
[   97.794504]  rcu_process_callbacks+0x38b/0x1730
[   97.794506]  __do_softirq+0x1c8/0x5d0

Reported-by: John Sperbeck <jsperbeck@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:17:01 -07:00
YueHaibing
381897798a net/smc: fix sizeof to int comparison
Comparing an int to a size, which is unsigned, causes the int to become
unsigned, giving the wrong result. kernel_sendmsg can return a negative
error code.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:11:43 -07:00
Karsten Graul
71d117f527 net/smc: no urgent data check for listen sockets
Don't check a listen socket for pending urgent data in smc_poll().

Signed-off-by: Karsten Graul <kgraul@linux.ibm.com>
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:11:43 -07:00
Ursula Braun
dd65d87a6a net/smc: enable fallback for connection abort in state INIT
If a linkgroup is terminated abnormally already due to failing
LLC CONFIRM LINK or LLC ADD LINK, fallback to TCP is still possible.
In this case do not switch to state SMC_PEERABORTWAIT and do not set
sk_err.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:11:43 -07:00
Ursula Braun
1ca52fcfac net/smc: remove duplicate mutex_unlock
For a failing smc_listen_rdma_finish() smc_listen_decline() is
called. If fallback is possible, the new socket is already enqueued
to be accepted in smc_listen_decline(). Avoid enqueuing a second time
afterwards in this case, otherwise the smc_create_lgr_pending lock
is released twice:
[  373.463976] WARNING: bad unlock balance detected!
[  373.463978] 4.18.0-rc7+ #123 Tainted: G           O
[  373.463979] -------------------------------------
[  373.463980] kworker/1:1/30 is trying to release lock (smc_create_lgr_pending) at:
[  373.463990] [<000003ff801205fc>] smc_listen_work+0x22c/0x5d0 [smc]
[  373.463991] but there are no more locks to release!
[  373.463991]
other info that might help us debug this:
[  373.463993] 2 locks held by kworker/1:1/30:
[  373.463994]  #0: 00000000772cbaed ((wq_completion)"events"){+.+.}, at: process_one_work+0x1ec/0x6b0
[  373.464000]  #1: 000000003ad0894a ((work_completion)(&new_smc->smc_listen_work)){+.+.}, at: process_one_work+0x1ec/0x6b0
[  373.464003]
stack backtrace:
[  373.464005] CPU: 1 PID: 30 Comm: kworker/1:1 Kdump: loaded Tainted: G           O      4.18.0-rc7uschi+ #123
[  373.464007] Hardware name: IBM 2827 H43 738 (LPAR)
[  373.464010] Workqueue: events smc_listen_work [smc]
[  373.464011] Call Trace:
[  373.464015] ([<0000000000114100>] show_stack+0x60/0xd8)
[  373.464019]  [<0000000000a8c9bc>] dump_stack+0x9c/0xd8
[  373.464021]  [<00000000001dcaf8>] print_unlock_imbalance_bug+0xf8/0x108
[  373.464022]  [<00000000001e045c>] lock_release+0x114/0x4f8
[  373.464025]  [<0000000000aa87fa>] __mutex_unlock_slowpath+0x4a/0x300
[  373.464027]  [<000003ff801205fc>] smc_listen_work+0x22c/0x5d0 [smc]
[  373.464029]  [<0000000000197a68>] process_one_work+0x2a8/0x6b0
[  373.464030]  [<0000000000197ec2>] worker_thread+0x52/0x410
[  373.464033]  [<000000000019fd0e>] kthread+0x15e/0x178
[  373.464035]  [<0000000000aaf58a>] kernel_thread_starter+0x6/0xc
[  373.464052]  [<0000000000aaf584>] kernel_thread_starter+0x0/0xc
[  373.464054] INFO: lockdep is turned off.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:11:43 -07:00
Ursula Braun
648a5a7aed net/smc: fix non-blocking connect problem
In state SMC_INIT smc_poll() delegates polling to the internal
CLC socket. This means, once the connect worker has finished
its kernel_connect() step, the poll wake-up may occur. This is not
intended. The wake-up should occur from the wake up call in
smc_connect_work() after __smc_connect() has finished.
Thus in state SMC_INIT this patch now calls sock_poll_wait() on the
main SMC socket.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 20:11:43 -07:00
Suren Baghdasaryan
674d9de02a NFC: Fix possible memory corruption when handling SHDLC I-Frame commands
When handling SHDLC I-Frame commands "pipe" field used for indexing
into an array should be checked before usage. If left unchecked it
might access memory outside of the array of size NFC_HCI_MAX_PIPES(127).

Malformed NFC HCI frames could be injected by a malicious NFC device
communicating with the device being attacked (remote attack vector),
or even by an attacker with physical access to the I2C bus such that
they could influence the data transfers on that bus (local attack vector).
skb->data is controlled by the attacker and has only been sanitized in
the most trivial ways (CRC check), therefore we can consider the
create_info struct and all of its members to tainted. 'create_info->pipe'
with max value of 255 (uint8) is used to take an offset of the
hdev->pipes array of 127 elements which can lead to OOB write.

Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: Allen Pais <allen.pais@oracle.com>
Cc: "David S. Miller" <davem@davemloft.net>
Suggested-by: Kevin Deus <kdeus@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Kees Cook <keescook@chromium.org>
Cc: stable <stable@vger.kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 19:55:01 -07:00
David S. Miller
e366fa4350 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Two new tls tests added in parallel in both net and net-next.

Used Stephen Rothwell's linux-next resolution.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-18 09:33:27 -07:00
Peter Oskolkov
30bfd93062 net/ipv6: do not copy dst flags on rt init
DST_NOCOUNT in dst_entry::flags tracks whether the entry counts
toward route cache size (net->ipv6.sysctl.ip6_rt_max_size).

If the flag is NOT set, dst_ops::pcpuc_entries counter is incremented
in dist_init() and decremented in dst_destroy().

This flag is tied to allocation/deallocation of dst_entry and
should not be copied from another dst/route. Otherwise it can happen
that dst_ops::pcpuc_entries counter grows until no new routes can
be allocated because the counter reached ip6_rt_max_size due to
DST_NOCOUNT not set and thus no counter decrements on gc-ed routes.

Fixes: 3b6761d18b ("net/ipv6: Move dst flags to booleans in fib entries")
Cc: David Ahern <dsahern@gmail.com>
Acked-by: Wei Wang <weiwan@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 19:42:14 -07:00
Stefan Nuernberger
076ed3da0c net/ipv4: defensive cipso option parsing
commit 40413955ee ("Cipso: cipso_v4_optptr enter infinite loop") fixed
a possible infinite loop in the IP option parsing of CIPSO. The fix
assumes that ip_options_compile filtered out all zero length options and
that no other one-byte options beside IPOPT_END and IPOPT_NOOP exist.
While this assumption currently holds true, add explicit checks for zero
length and invalid length options to be safe for the future. Even though
ip_options_compile should have validated the options, the introduction of
new one-byte options can still confuse this code without the additional
checks.

Signed-off-by: Stefan Nuernberger <snu@amazon.com>
Cc: David Woodhouse <dwmw@amazon.co.uk>
Cc: Simon Veith <sveith@amazon.de>
Cc: stable@vger.kernel.org
Acked-by: Paul Moore <paul@paul-moore.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 19:37:46 -07:00
Colin Ian King
5d288b8865 net: caif: remove redundant null check on frontpkt
It is impossible for frontpkt to be null at the point of the null
check because it has been assigned from rearpkt and there is no
way rearpkt can be null at the point of the assignment because
of the sanity checking and exit paths taken previously. Remove
the redundant null check.

Detected by CoverityScan, CID#114434 ("Logically dead code")

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 18:49:57 -07:00
David S. Miller
3275b4df3c Revert "kcm: remove any offset before parsing messages"
This reverts commit 072222b488.

I just read that this causes regressions.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 18:43:42 -07:00
Dominique Martinet
072222b488 kcm: remove any offset before parsing messages
The current code assumes kcm users know they need to look for the
strparser offset within their bpf program, which is not documented
anywhere and examples laying around do not do.

The actual recv function does handle the offset well, so we can create a
temporary clone of the skb and pull that one up as required for parsing.

The pull itself has a cost if we are pulling beyond the head data,
measured to 2-3% latency in a noisy VM with a local client stressing
that path. The clone's impact seemed too small to measure.

This bug can be exhibited easily by implementing a "trivial" kcm parser
taking the first bytes as size, and on the client sending at least two
such packets in a single write().

Note that bpf sockmap has the same problem, both for parse and for recv,
so it would pulling twice or a real pull within the strparser logic if
anyone cares about that.

Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 18:42:39 -07:00
zhong jiang
b458925ed5 net: dsa: remove redundant null pointer check before put_device
put_device has taken the null pinter check into account. So it is
safe to remove the duplicated check before put_device.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 08:20:11 -07:00
Zhu Yanjun
1635bb548f net: rds: use memset to optimize the recv
The function rds_inc_init is in recv process. To use memset can optimize
the function rds_inc_init.
The test result:

     Before:
     1) + 24.950 us   |        rds_inc_init [rds]();
     After:
     1) + 10.990 us   |        rds_inc_init [rds]();

Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 08:19:51 -07:00
Hauke Mehrtens
711ddb625c net: dsa: tag_gswip: Add gswip to dsa_tag_protocol_to_str()
The gswip tag was missing in the dsa_tag_protocol_to_str() function, add it.

Fixes: 7969119293 ("net: dsa: Add Lantiq / Intel GSWIP tag support")
Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 08:12:11 -07:00
Daniel Borkmann
50c6b58a81 tls: fix currently broken MSG_PEEK behavior
In kTLS MSG_PEEK behavior is currently failing, strace example:

  [pid  2430] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
  [pid  2430] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
  [pid  2430] bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2430] listen(4, 10)               = 0
  [pid  2430] getsockname(4, {sa_family=AF_INET, sin_port=htons(38855), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
  [pid  2430] connect(3, {sa_family=AF_INET, sin_port=htons(38855), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2430] setsockopt(3, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2430] setsockopt(3, 0x11a /* SOL_?? */, 1, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2430] accept(4, {sa_family=AF_INET, sin_port=htons(49636), sin_addr=inet_addr("127.0.0.1")}, [16]) = 5
  [pid  2430] setsockopt(5, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2430] setsockopt(5, 0x11a /* SOL_?? */, 2, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2430] close(4)                    = 0
  [pid  2430] sendto(3, "test_read_peek", 14, 0, NULL, 0) = 14
  [pid  2430] sendto(3, "_mult_recs\0", 11, 0, NULL, 0) = 11
  [pid  2430] recvfrom(5, "test_read_peektest_read_peektest"..., 64, MSG_PEEK, NULL, NULL) = 64

As can be seen from strace, there are two TLS records sent,
i) 'test_read_peek' and ii) '_mult_recs\0' where we end up
peeking 'test_read_peektest_read_peektest'. This is clearly
wrong, and what happens is that given peek cannot call into
tls_sw_advance_skb() to unpause strparser and proceed with
the next skb, we end up looping over the current one, copying
the 'test_read_peek' over and over into the user provided
buffer.

Here, we can only peek into the currently held skb (current,
full TLS record) as otherwise we would end up having to hold
all the original skb(s) (depending on the peek depth) in a
separate queue when unpausing strparser to process next
records, minimally intrusive is to return only up to the
current record's size (which likely was what c46234ebb4
("tls: RX path for ktls") originally intended as well). Thus,
after patch we properly peek the first record:

  [pid  2046] wait4(2075,  <unfinished ...>
  [pid  2075] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 3
  [pid  2075] socket(AF_INET, SOCK_STREAM, IPPROTO_IP) = 4
  [pid  2075] bind(4, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2075] listen(4, 10)               = 0
  [pid  2075] getsockname(4, {sa_family=AF_INET, sin_port=htons(55115), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
  [pid  2075] connect(3, {sa_family=AF_INET, sin_port=htons(55115), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
  [pid  2075] setsockopt(3, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2075] setsockopt(3, 0x11a /* SOL_?? */, 1, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2075] accept(4, {sa_family=AF_INET, sin_port=htons(45732), sin_addr=inet_addr("127.0.0.1")}, [16]) = 5
  [pid  2075] setsockopt(5, SOL_TCP, 0x1f /* TCP_??? */, [7564404], 4) = 0
  [pid  2075] setsockopt(5, 0x11a /* SOL_?? */, 2, "\3\0033\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 40) = 0
  [pid  2075] close(4)                    = 0
  [pid  2075] sendto(3, "test_read_peek", 14, 0, NULL, 0) = 14
  [pid  2075] sendto(3, "_mult_recs\0", 11, 0, NULL, 0) = 11
  [pid  2075] recvfrom(5, "test_read_peek", 64, MSG_PEEK, NULL, NULL) = 14

Fixes: c46234ebb4 ("tls: RX path for ktls")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 08:03:09 -07:00
John Fastabend
7a3dd8c897 tls: async support causes out-of-bounds access in crypto APIs
When async support was added it needed to access the sk from the async
callback to report errors up the stack. The patch tried to use space
after the aead request struct by directly setting the reqsize field in
aead_request. This is an internal field that should not be used
outside the crypto APIs. It is used by the crypto code to define extra
space for private structures used in the crypto context. Users of the
API then use crypto_aead_reqsize() and add the returned amount of
bytes to the end of the request memory allocation before posting the
request to encrypt/decrypt APIs.

So this breaks (with general protection fault and KASAN error, if
enabled) because the request sent to decrypt is shorter than required
causing the crypto API out-of-bounds errors. Also it seems unlikely the
sk is even valid by the time it gets to the callback because of memset
in crypto layer.

Anyways, fix this by holding the sk in the skb->sk field when the
callback is set up and because the skb is already passed through to
the callback handler via void* we can access it in the handler. Then
in the handler we need to be careful to NULL the pointer again before
kfree_skb. I added comments on both the setup (in tls_do_decryption)
and when we clear it from the crypto callback handler
tls_decrypt_done(). After this selftests pass again and fixes KASAN
errors/warnings.

Fixes: 94524d8fc9 ("net/tls: Add support for async decryption of tls records")
Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Reviewed-by: Vakul Garg <Vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 08:01:36 -07:00
Eric Dumazet
bbd6528d28 ipv6: fix possible use-after-free in ip6_xmit()
In the unlikely case ip6_xmit() has to call skb_realloc_headroom(),
we need to call skb_set_owner_w() before consuming original skb,
otherwise we risk a use-after-free.

Bring IPv6 in line with what we do in IPv4 to fix this.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-17 07:56:02 -07:00
Taehee Yoo
c8204cab9c netfilter: nat: remove unnecessary rcu_read_lock in nf_nat_redirect_ipv{4/6}
nf_nat_redirect_ipv4() and nf_nat_redirect_ipv6() are only called by
netfilter hook point. so that rcu_read_lock and rcu_read_unlock() are
unnecessary.

Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 16:11:14 +02:00
Pablo Neira Ayuso
4430b897a2 netfilter: cttimeout: remove superfluous check on layer 4 netlink functions
We assume they are always set accordingly since a874752a10
("netfilter: conntrack: timeout interface depend on
CONFIG_NF_CONNTRACK_TIMEOUT"), so we can get rid of this checks.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 16:11:14 +02:00
Florian Westphal
7052ba4080 netfilter: nf_nat_ipv4: remove obsolete EXPORT_SYMBOL
There are no external callers anymore, previous change just
forgot to also remove the EXPORT_SYMBOL().

Fixes: 9971a514ed ("netfilter: nf_nat: add nat type hooks to nat core")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 16:11:13 +02:00
Florian Westphal
70c0eb1ca0 netfilter: xtables: avoid BUG_ON
I see no reason for them, label or timer cannot be NULL, and if they
were, we'll crash with null deref anyway.

For skb_header_pointer failure, just set hotdrop to true and toss
such packet.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 16:11:12 +02:00
Florian Westphal
fa5950e498 netfilter: nf_tables: avoid BUG_ON usage
None of these spots really needs to crash the kernel.
In one two cases we can jsut report error to userspace, in the other
cases we can just use WARN_ON (and leak memory instead).

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 16:11:12 +02:00
Pablo Neira Ayuso
0d704967f4 netfilter: xt_cgroup: shrink size of v2 path
cgroup v2 path field is PATH_MAX which is too large, this is placing too
much pressure on memory allocation for people with many rules doing
cgroup v1 classid matching, side effects of this are bug reports like:

https://bugzilla.kernel.org/show_bug.cgi?id=200639

This patch registers a new revision that shrinks the cgroup path to 512
bytes, which is the same approach we follow in similar extensions that
have a path field.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
Acked-by: Tejun Heo <tj@kernel.org>
2018-09-17 16:11:03 +02:00
Kristian Evensen
59c08c69c2 netfilter: ctnetlink: Support L3 protocol-filter on flush
The same connection mark can be set on flows belonging to different
address families. This commit adds support for filtering on the L3
protocol when flushing connection track entries. If no protocol is
specified, then all L3 protocols match.

In order to avoid code duplication and a redundant check, the protocol
comparison in ctnetlink_dump_table() has been removed. Instead, a filter
is created if the GET-message triggering the dump contains an address
family. ctnetlink_filter_match() is then used to compare the L3
protocols.

Signed-off-by: Kristian Evensen <kristian.evensen@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 12:04:14 +02:00
Florian Westphal
6c47260250 netfilter: nf_tables: add xfrm expression
supports fetching saddr/daddr of tunnel mode states, request id and spi.
If direction is 'in', use inbound skb secpath, else dst->xfrm.

Joint work with Máté Eckl.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 11:40:08 +02:00
Florian Westphal
2953d80ff0 netfilter: remove obsolete need_conntrack stub
as of a0ae2562c6 ("netfilter: conntrack: remove l3proto
abstraction") there are no users anymore.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 11:40:07 +02:00
Florian Westphal
0935d55884 netfilter: nf_tables: asynchronous release
Release the committed transaction log from a work queue, moving
expensive synchronize_rcu out of the locked section and providing
opportunity to batch this.

On my test machine this cuts runtime of nft-test.py in half.
Based on earlier patch from Pablo Neira Ayuso.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 11:40:07 +02:00
Florian Westphal
0ef235c717 netfilter: nf_tables: warn when expr implements only one of activate/deactivate
->destroy is only allowed to free data, or do other cleanups that do not
have side effects on other state, such as visibility to other netlink
requests.

Such things need to be done in ->deactivate.
As a transaction can fail, we need to make sure we can undo such
operations, therefore ->activate() has to be provided too.

So print a warning and refuse registration if expr->ops provides
only one of the two operations.

v2: fix nft_expr_check_ops to not repeat same check twice (Jones Desougi)

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 11:40:06 +02:00
Florian Westphal
cd5125d8f5 netfilter: nf_tables: split set destruction in deactivate and destroy phase
Splits unbind_set into destroy_set and unbinding operation.

Unbinding removes set from lists (so new transaction would not
find it anymore) but keeps memory allocated (so packet path continues
to work).

Rebind function is added to allow unrolling in case transaction
that wants to remove set is aborted.

Destroy function is added to free the memory, but this could occur
outside of transaction in the future.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 11:29:49 +02:00
Florian Westphal
02b408fae3 netfilter: nf_tables: rt: allow checking if dst has xfrm attached
Useful e.g. to avoid NATting inner headers of to-be-encrypted packets.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-17 11:29:49 +02:00
David S. Miller
0376d5dce0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-09-16

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix end boundary calculation in BTF for the type section, from Martin.

2) Fix and revert subtraction of pointers that was accidentally allowed
   for unprivileged programs, from Alexei.

3) Fix bpf_msg_pull_data() helper by using __GFP_COMP in order to avoid
   a warning in linearizing sg pages into a single one for large allocs,
   from Tushar.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 17:47:03 -07:00
Haishuang Yan
a82738adff ip6_gre: simplify gre header parsing in ip6gre_err
Same as ip_gre, use gre_parse_header to parse gre header in gre error
handler code.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:32:59 -07:00
Haishuang Yan
b0350d51f0 ip_gre: fix parsing gre header in ipgre_err
gre_parse_header stops parsing when csum_err is encountered, which means
tpi->key is undefined and ip_tunnel_lookup will return NULL improperly.

This patch introduce a NULL pointer as csum_err parameter. Even when
csum_err is encountered, it won't return error and continue parsing gre
header as expected.

Fixes: 9f57c67c37 ("gre: Remove support for sharing GRE protocol hook.")
Reported-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:32:59 -07:00
Davide Caratti
2d550dbad8 net/sched: act_police: don't use spinlock in the data path
use RCU instead of spinlocks, to protect concurrent read/write on
act_police configuration. This reduces the effects of contention in the
data path, in case multiple readers are present.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:30:22 -07:00
Davide Caratti
93be42f917 net/sched: act_police: use per-cpu counters
use per-CPU counters, instead of sharing a single set of stats with all
cores. This removes the need of using spinlock when statistics are read
or updated.

Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:30:22 -07:00
Paolo Abeni
eb63f2964d udp6: add missing checks on edumux packet processing
Currently the UDPv6 early demux rx code path lacks some mandatory
checks, already implemented into the normal RX code path - namely
the checksum conversion and no_check6_rx check.

Similar to the previous commit, we move the common processing to
an UDPv6 specific helper and call it from both edemux code path
and normal code path. In respect to the UDPv4, we need to add an
explicit check for non zero csum according to no_check6_rx value.

Reported-by: Jianlin Shi <jishi@redhat.com>
Suggested-by: Xin Long <lucien.xin@gmail.com>
Fixes: c9f2c1ae12 ("udp6: fix socket leak on early demux")
Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:27:44 -07:00
Paolo Abeni
2b5a921740 udp4: fix IP_CMSG_CHECKSUM for connected sockets
commit 2abb7cdc0d ("udp: Add support for doing checksum
unnecessary conversion") left out the early demux path for
connected sockets. As a result IP_CMSG_CHECKSUM gives wrong
values for such socket when GRO is not enabled/available.

This change addresses the issue by moving the csum conversion to a
common helper and using such helper in both the default and the
early demux rx path.

Fixes: 2abb7cdc0d ("udp: Add support for doing checksum unnecessary conversion")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-16 15:27:44 -07:00
Sven Eckelmann
a7ea49afcc batman-adv: Enable LockLess TX for softif
The batadv interfaces are virtual interfaces which just tunnel the traffic
over other ethernet compatible interfaces. It doesn't need serialization
during the tx phase and is using RCU for most of its internal
datastructures. Since it doesn't have actual queues which could be locked
independently, the throughput gets significantly reduced by the extra lock
in the core net code.

8 parallel TCP connections forwarded by an IPQ4019 based hardware over
5GHz could reach:

* without LLTX: 349 Mibit/s
* with LLTX:    563 Mibit/s

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-15 09:51:10 +02:00
Sven Eckelmann
dee222c7b2 batman-adv: Move OGM rebroadcast stats to orig_ifinfo
B.A.T.M.A.N. IV requires the number of rebroadcast from a neighboring
originator. These statistics are gathered per interface which transmitted
the OGM (and then received it again). Since an originator is not interface
specific, a resizable array was used in each originator.

This resizable array had an entry for each interface and had to be resizes
(for all OGMs) when the number of active interface was modified. This could
cause problems when a large number of interface is added and not enough
continuous memory is available to allocate the array.

There is already a per interface originator structure "batadv_orig_ifinfo"
which can be used to store this information.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-15 09:51:10 +02:00
Sven Eckelmann
d331a73994 batman-adv: Provide debug messages as trace events
A private debug logging infrastructure is currently provided via
$debug_fs/batman_adv/*/log when CONFIG_BATMAN_ADV_DEBUG is enabled. This is
not well integrated in the rest of the tracing infrastructure of the
kernel.

Other components (like mac80211 or ath10k) allow to gather the debug
messages using generic trace events which are better integrated. This makes
it possible to interact with them using the existing userspace tools.

The tracepoint batadv:batadv_dbg will now be available when
CONFIG_BATMAN_ADV_DEBUG and CONFIG_BATMAN_ADV_TRACING is activated. The log
level mask is still used for filtering as usual.

A full system trace for offline parsing can be created (and read) using:

  $ batctl ll all
  $ trace-cmd record -e batadv:batadv_dbg
  $ trace-cmd report

The same can also be done without recording to a file

  $ batctl ll all
  $ trace-cmd stream -e batadv:batadv_dbg

The trace infrastructure is especially helpful when tracing processes:

  $ batctl ll all
  $ ./tools/perf/perf trace --event "batadv:*" batctl p 10.204.32.1
     0.000 batadv:batadv_dbg:batman_adv bat0 Parsing outgoing ARP REQUEST
     0.045 batadv:batadv_dbg:batman_adv bat0 ARP MSG = [src: a2:64:14:53:f8:22-10.204.32.185 dst: 00:00:00:00:00:00-10.204.32.1]
     0.067 batadv:batadv_dbg:batman_adv bat0 Entry updated: 10.204.32.185 a2:64:14:53:f8:22 (vid: -1)
     0.099 batadv:batadv_dbg:batman_adv bat0 batadv_dat_select_candidates(): IP=10.204.32.1 hash(IP)=48902
     0.757 batadv:batadv_dbg:batman_adv bat0 dat_select_candidates() 0: selected fe:2c:91:68:29:2b addr=48977 dist=65460
     1.178 batadv:batadv_dbg:batman_adv bat0 dat_select_candidates() 1: selected fe:81:ab:c5:e3:03 addr=49181 dist=65256
     1.809 batadv:batadv_dbg:batman_adv bat0 dat_select_candidates() 2: selected 66:25:a7:48:37:fb addr=49328 dist=65109
     1.828 batadv:batadv_dbg:batman_adv bat0 DHT_SEND for 10.204.32.1

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-15 09:51:08 +02:00
Petar Penkov
d58e468b11 flow_dissector: implements flow dissector BPF hook
Adds a hook for programs of type BPF_PROG_TYPE_FLOW_DISSECTOR and
attach type BPF_FLOW_DISSECTOR that is executed in the flow dissector
path. The BPF program is per-network namespace.

Signed-off-by: Petar Penkov <ppenkov@google.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-14 12:04:33 -07:00
Sven Eckelmann
dabeb13eee batman-adv: Increase version number to 2018.3
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14 17:59:20 +02:00
Davide Caratti
34043d250f net/sched: act_sample: fix NULL dereference in the data path
Matteo reported the following splat, testing the datapath of TC 'sample':

 BUG: KASAN: null-ptr-deref in tcf_sample_act+0xc4/0x310
 Read of size 8 at addr 0000000000000000 by task nc/433

 CPU: 0 PID: 433 Comm: nc Not tainted 4.19.0-rc3-kvm #17
 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS ?-20180531_142017-buildhw-08.phx2.fedoraproject.org-1.fc28 04/01/2014
 Call Trace:
  kasan_report.cold.6+0x6c/0x2fa
  tcf_sample_act+0xc4/0x310
  ? dev_hard_start_xmit+0x117/0x180
  tcf_action_exec+0xa3/0x160
  tcf_classify+0xdd/0x1d0
  htb_enqueue+0x18e/0x6b0
  ? deref_stack_reg+0x7a/0xb0
  ? htb_delete+0x4b0/0x4b0
  ? unwind_next_frame+0x819/0x8f0
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
  __dev_queue_xmit+0x722/0xca0
  ? unwind_get_return_address_ptr+0x50/0x50
  ? netdev_pick_tx+0xe0/0xe0
  ? save_stack+0x8c/0xb0
  ? kasan_kmalloc+0xbe/0xd0
  ? __kmalloc_track_caller+0xe4/0x1c0
  ? __kmalloc_reserve.isra.45+0x24/0x70
  ? __alloc_skb+0xdd/0x2e0
  ? sk_stream_alloc_skb+0x91/0x3b0
  ? tcp_sendmsg_locked+0x71b/0x15a0
  ? tcp_sendmsg+0x22/0x40
  ? __sys_sendto+0x1b0/0x250
  ? __x64_sys_sendto+0x6f/0x80
  ? do_syscall_64+0x5d/0x150
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ? __sys_sendto+0x1b0/0x250
  ? __x64_sys_sendto+0x6f/0x80
  ? do_syscall_64+0x5d/0x150
  ? entry_SYSCALL_64_after_hwframe+0x44/0xa9
  ip_finish_output2+0x495/0x590
  ? ip_copy_metadata+0x2e0/0x2e0
  ? skb_gso_validate_network_len+0x6f/0x110
  ? ip_finish_output+0x174/0x280
  __tcp_transmit_skb+0xb17/0x12b0
  ? __tcp_select_window+0x380/0x380
  tcp_write_xmit+0x913/0x1de0
  ? __sk_mem_schedule+0x50/0x80
  tcp_sendmsg_locked+0x49d/0x15a0
  ? tcp_rcv_established+0x8da/0xa30
  ? tcp_set_state+0x220/0x220
  ? clear_user+0x1f/0x50
  ? iov_iter_zero+0x1ae/0x590
  ? __fget_light+0xa0/0xe0
  tcp_sendmsg+0x22/0x40
  __sys_sendto+0x1b0/0x250
  ? __ia32_sys_getpeername+0x40/0x40
  ? _copy_to_user+0x58/0x70
  ? poll_select_copy_remaining+0x176/0x200
  ? __pollwait+0x1c0/0x1c0
  ? ktime_get_ts64+0x11f/0x140
  ? kern_select+0x108/0x150
  ? core_sys_select+0x360/0x360
  ? vfs_read+0x127/0x150
  ? kernel_write+0x90/0x90
  __x64_sys_sendto+0x6f/0x80
  do_syscall_64+0x5d/0x150
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7fefef2b129d
 Code: ff ff ff ff eb b6 0f 1f 80 00 00 00 00 48 8d 05 51 37 0c 00 41 89 ca 8b 00 85 c0 75 20 45 31 c9 45 31 c0 b8 2c 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 6b f3 c3 66 0f 1f 84 00 00 00 00 00 41 56 41
 RSP: 002b:00007fff2f5350c8 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
 RAX: ffffffffffffffda RBX: 000056118d60c120 RCX: 00007fefef2b129d
 RDX: 0000000000002000 RSI: 000056118d629320 RDI: 0000000000000003
 RBP: 000056118d530370 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000002000
 R13: 000056118d5c2a10 R14: 000056118d5c2a10 R15: 000056118d5303b8

tcf_sample_act() tried to update its per-cpu stats, but tcf_sample_init()
forgot to allocate them, because tcf_idr_create() was called with a wrong
value of 'cpustats'. Setting it to true proved to fix the reported crash.

Reported-by: Matteo Croce <mcroce@redhat.com>
Fixes: 65a206c01e ("net/sched: Change act_api and act_xxx modules to use IDR")
Fixes: 5c5670fae4 ("net/sched: Introduce sample tc action")
Tested-by: Matteo Croce <mcroce@redhat.com>
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-14 08:46:28 -07:00
Sven Eckelmann
00caf6a2b3 batman-adv: Mark debugfs functionality as deprecated
CONFIG_BATMAN_ADV_DEBUGFS is disabled by default because debugfs is not
supported for batman-adv interfaces in any non-default netns. Any remaining
users of this interface should still be informed about the deprecation and
the generic netlink alternative.

Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14 10:50:26 +02:00
Simon Wunderlich
138c72efbd batman-adv: Start new development cycle
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-14 10:50:26 +02:00
Johannes Berg
1cebf8f143 socket: fix struct ifreq size in compat ioctl
As reported by Reobert O'Callahan, since Viro's commit to kill
dev_ifsioc() we attempt to copy too much data in compat mode,
which may lead to EFAULT when the 32-bit version of struct ifreq
sits at/near the end of a page boundary, and the next page isn't
mapped.

Fix this by passing the approprate compat/non-compat size to copy
and using that, as before the dev_ifsioc() removal. This works
because only the embedded "struct ifmap" has different size, and
this is only used in SIOCGIFMAP/SIOCSIFMAP which has a different
handler. All other parts of the union are naturally compatible.

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199469.

Fixes: bf4405737f ("kill dev_ifsioc()")
Reported-by: Robert O'Callahan <robert@ocallahan.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 16:01:06 -07:00
Gustavo A. R. Silva
f91845da9f pktgen: Fix fall-through annotation
Replace "fallthru" with a proper "fall through" annotation.

This fix is part of the ongoing efforts to enabling
-Wimplicit-fallthrough

Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 15:36:41 -07:00
Toke Høiland-Jørgensen
c56cae23c6 gso_segment: Reset skb->mac_len after modifying network header
When splitting a GSO segment that consists of encapsulated packets, the
skb->mac_len of the segments can end up being set wrong, causing packet
drops in particular when using act_mirred and ifb interfaces in
combination with a qdisc that splits GSO packets.

This happens because at the time skb_segment() is called, network_header
will point to the inner header, throwing off the calculation in
skb_reset_mac_len(). The network_header is subsequently adjust by the
outer IP gso_segment handlers, but they don't set the mac_len.

Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
gso_segment handlers, after they modify the network_header.

Many thanks to Eric Dumazet for his help in identifying the cause of
the bug.

Acked-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:09:32 -07:00
Toke Høiland-Jørgensen
50c12f7401 gso_segment: Reset skb->mac_len after modifying network header
When splitting a GSO segment that consists of encapsulated packets, the
skb->mac_len of the segments can end up being set wrong, causing packet
drops in particular when using act_mirred and ifb interfaces in
combination with a qdisc that splits GSO packets.

This happens because at the time skb_segment() is called, network_header
will point to the inner header, throwing off the calculation in
skb_reset_mac_len(). The network_header is subsequently adjust by the
outer IP gso_segment handlers, but they don't set the mac_len.

Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6
gso_segment handlers, after they modify the network_header.

Many thanks to Eric Dumazet for his help in identifying the cause of
the bug.

Acked-by: Dave Taht <dave.taht@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:08:40 -07:00
David S. Miller
bffb9b51a2 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth
Johan Hedberg says:

====================
pull request: bluetooth 2018-09-13

A few Bluetooth fixes for the 4.19-rc series:

 - Fixed rw_semaphore leak in hci_ldisc
 - Fixed local Out-of-Band pairing data handling

Let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:06:12 -07:00
Sabrina Dubroca
c844eb46b7 tls: clear key material from kernel memory when do_tls_setsockopt_conf fails
Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:03:47 -07:00
Sabrina Dubroca
86029d10af tls: zero the crypto information from tls_context before freeing
This contains key material in crypto_send_aes_gcm_128 and
crypto_recv_aes_gcm_128.

Introduce union tls_crypto_context, and replace the two identical
unions directly embedded in struct tls_context with it. We can then
use this union to clean up the memory in the new tls_ctx_free()
function.

Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:03:47 -07:00
Sabrina Dubroca
7cba09c6d5 tls: don't copy the key out of tls12_crypto_info_aes_gcm_128
There's no need to copy the key to an on-stack buffer before calling
crypto_aead_setkey().

Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:03:47 -07:00
Vasily Khoruzhick
f0e0d04413 neighbour: confirm neigh entries when ARP packet is received
Update 'confirmed' timestamp when ARP packet is received. It shouldn't
affect locktime logic and anyway entry can be confirmed by any higher-layer
protocol. Thus it makes sense to confirm it when ARP packet is received.

Fixes: 77d7123342 ("neighbour: update neigh timestamps iff update is effective")
Signed-off-by: Vasily Khoruzhick <vasilykh@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 12:01:29 -07:00
Roopa Prabhu
56a49d7048 net: rtnl_configure_link: fix dev flags changes arg to __dev_notify_flags
This fix addresses https://bugzilla.kernel.org/show_bug.cgi?id=201071

Commit 5025f7f7d5 wrongly relied on __dev_change_flags to notify users of
dev flag changes in the case when dev->rtnl_link_state = RTNL_LINK_INITIALIZED.
Fix it by indicating flag changes explicitly to __dev_notify_flags.

Fixes: 5025f7f7d5 ("rtnetlink: add rtnl_link_state check in rtnl_configure_link")
Reported-By: Liam mcbirnie <liam.mcbirnie@boeing.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 11:01:32 -07:00
Cong Wang
f5b9bac745 net_sched: notify filter deletion when deleting a chain
When we delete a chain of filters, we need to notify
user-space we are deleting each filters in this chain
too.

Fixes: 32a4f5ecd7 ("net: sched: introduce chain object to uapi")
Cc: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:07:40 -07:00
Cong Wang
9708d2b5b7 llc: avoid blocking in llc_sap_close()
llc_sap_close() is called by llc_sap_put() which
could be called in BH context in llc_rcv(). We can't
block in BH.

There is no reason to block it here, kfree_rcu() should
be sufficient.

Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 09:04:58 -07:00
Xin Long
22d0bd82cc ipv6: use rt6_info members when dst is set in rt6_fill_node
In inet6_rtm_getroute, since Commit 93531c6743 ("net/ipv6: separate
handling of FIB entries from dst based routes"), it has used rt->from
to dump route info instead of rt.

However for some route like cache, some of its information like flags
or gateway is not the same as that of the 'from' one. It caused 'ip
route get' to dump the wrong route information.

In Jianlin's testing, the output information even lost the expiration
time for a pmtu route cache due to the wrong fib6_flags.

So change to use rt6_info members for dst addr, src addr, flags and
gateway when it tries to dump a route entry without fibmatch set.

v1->v2:
  - not use rt6i_prefsrc.
  - also fix the gw dump issue.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 08:20:37 -07:00
Andre Naujoks
15033f0457 ipv6: Add sockopt IPV6_MULTICAST_ALL analogue to IP_MULTICAST_ALL
The socket option will be enabled by default to ensure current behaviour
is not changed. This is the same for the IPv4 version.

A socket bound to in6addr_any and a specific port will receive all traffic
on that port. Analogue to IP_MULTICAST_ALL, disable this behaviour, if
one or more multicast groups were joined (using said socket) and only
pass on multicast traffic from groups, which were explicitly joined via
this socket.

Without this option disabled a socket (system even) joined to multiple
multicast groups is very hard to get right. Filtering by destination
address has to take place in user space to avoid receiving multicast
traffic from other multicast groups, which might have traffic on the same
port.

The extension of the IP_MULTICAST_ALL socketoption to just apply to ipv6,
too, is not done to avoid changing the behaviour of current applications.

Signed-off-by: Andre Naujoks <nautsch2@gmail.com>
Acked-By: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 08:17:27 -07:00
Hauke Mehrtens
7969119293 net: dsa: Add Lantiq / Intel GSWIP tag support
This handles the tag added by the PMAC on the VRX200 SoC line.

The GSWIP uses internally a GSWIP special tag which is located after the
Ethernet header. The PMAC which connects the GSWIP to the CPU converts
this special tag used by the GSWIP into the PMAC special tag which is
added in front of the Ethernet header.

This was tested with GSWIP 2.1 found in the VRX200 SoCs, other GSWIP
versions use slightly different PMAC special tags.

Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-13 08:14:33 -07:00
David S. Miller
aaf9253025 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-09-12 22:22:42 -07:00
Nikolay Aleksandrov
435f2e7cc0 net: bridge: add support for sticky fdb entries
Add support for entries which are "sticky", i.e. will not change their port
if they show up from a different one. A new ndm flag is introduced for that
purpose - NTF_STICKY. We allow to set it only to non-local entries.

Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12 20:30:03 -07:00
Tushar Dave
4c3d795cb0 bpf: use __GFP_COMP while allocating page
Helper bpg_msg_pull_data() can allocate multiple pages while
linearizing multiple scatterlist elements into one shared page.
However, if the shared page has size > PAGE_SIZE, using
copy_page_to_iter() causes below warning.

e.g.
[ 6367.019832] WARNING: CPU: 2 PID: 7410 at lib/iov_iter.c:825
page_copy_sane.part.8+0x0/0x8

To avoid above warning, use __GFP_COMP while allocating multiple
contiguous pages.

Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-12 23:47:28 +02:00
Cong Wang
12a78b026f tipc: check return value of __tipc_dump_start()
When __tipc_dump_start() fails with running out of memory,
we have no reason to continue, especially we should avoid
calling tipc_dump_done().

Fixes: 8f5c5fcf35 ("tipc: call start and done ops directly in __tipc_nl_compat_dumpit()")
Reported-and-tested-by: syzbot+3f8324abccfbf8c74a9f@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12 13:15:04 -07:00
Cong Wang
cc4dfb7f70 rds: fix two RCU related problems
When a rds sock is bound, it is inserted into the bind_hash_table
which is protected by RCU. But when releasing rds sock, after it
is removed from this hash table, it is freed immediately without
respecting RCU grace period. This could cause some use-after-free
as reported by syzbot.

Mark the rds sock with SOCK_RCU_FREE before inserting it into the
bind_hash_table, so that it would be always freed after a RCU grace
period.

The other problem is in rds_find_bound(), the rds sock could be
freed in between rhashtable_lookup_fast() and rds_sock_addref(),
so we need to extend RCU read lock protection in rds_find_bound()
to close this race condition.

Reported-and-tested-by: syzbot+8967084bcac563795dc6@syzkaller.appspotmail.com
Reported-by: syzbot+93a5839deb355537440f@syzkaller.appspotmail.com
Cc: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Cc: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Cc: rds-devel@oss.oracle.com
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oarcle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12 00:09:19 -07:00
Li RongQing
0041195d55 netlink: remove hash::nelems check in netlink_insert
The type of hash::nelems has been changed from size_t to atom_t
which in fact is int, so not need to check if BITS_PER_LONG, that
is bit number of size_t, is bigger than 32

and rht_grow_above_max() will be called to check if hashtable is
too big, ensure it can not bigger than 1<<31

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12 00:08:44 -07:00
Vakul Garg
150085791a net/tls: Fixed return value when tls_complete_pending_work() fails
In tls_sw_sendmsg() and tls_sw_sendpage(), the variable 'ret' has
been set to return value of tls_complete_pending_work(). This allows
return of proper error code if tls_complete_pending_work() fails.

Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-12 00:04:28 -07:00
Haishuang Yan
51dc63e391 erspan: fix error handling for erspan tunnel
When processing icmp unreachable message for erspan tunnel, tunnel id
should be erspan_net_id instead of ipgre_net_id.

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 23:50:54 -07:00
Haishuang Yan
5a64506b5c erspan: return PACKET_REJECT when the appropriate tunnel is not found
If erspan tunnel hasn't been established, we'd better send icmp port
unreachable message after receive erspan packets.

Fixes: 84e54fe0a5 ("gre: introduce native tunnel support for ERSPAN")
Cc: William Tu <u9012063@gmail.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Acked-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 23:50:53 -07:00
Willem de Bruijn
0297c1c2ea tcp: rate limit synflood warnings further
Convert pr_info to net_info_ratelimited to limit the total number of
synflood warnings.

Commit 946cedccbd ("tcp: Change possible SYN flooding messages")
rate limits synflood warnings to one per listener.

Workloads that open many listener sockets can still see a high rate of
log messages. Syzkaller is one frequent example.

Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 23:34:20 -07:00
David S. Miller
4ecdf77091 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patchset contains Netfilter fixes for you net tree:

1) Remove duplicated include at the end of UDP conntrack, from Yue Haibing.

2) Restore conntrack dependency on xt_cluster, from Martin Willi.

3) Fix splat with GSO skbs from the checksum target, from Florian Westphal.

4) Rework ct timeout support, the template strategy to attach custom timeouts
   is not correct since it will not work in conjunction with conntrack zones
   and we have a possible free after use when removing the rule due to missing
   refcounting. To fix these problems, do not use conntrack template at all
   and set custom timeout on the already valid conntrack object. This
   fix comes with a preparation patch to simplify timeout adjustment by
   initializating the first position of the timeout array for all of the
   existing trackers. Patchset from Florian Westphal.

5) Fix missing dependency on from IPv4 chain NAT type, from Florian.

6) Release chain reference counter from the flush path, from Taehee Yoo.

7) After flushing an iptables ruleset, conntrack hooks are unregistered
   and entries are left stale to be cleaned up by the timeout garbage
   collector. No TCP tracking is done on established flows by this time.
   If ruleset is reloaded, then hooks are registered again and TCP
   tracking is restored, which considers packets to be invalid. Clear
   window tracking to exercise TCP flow pickup from the middle given that
   history is lost for us. Again from Florian.

8) Fix crash from netlink interface with CONFIG_NF_CONNTRACK_TIMEOUT=y
   and CONFIG_NF_CT_NETLINK_TIMEOUT=n.

9) Broken CT target due to returning incorrect type from
   ctnl_timeout_find_get().

10) Solve conntrack clash on NF_REPEAT verdicts too, from Michal Vaner.

11) Missing conversion of hashlimit sysctl interface to new API, from
    Cong Wang.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-11 21:17:30 -07:00
Anders Roxell
1edb6e035e net/core/filter: fix unused-variable warning
Building with CONFIG_INET=n will show the warning below:
net/core/filter.c: In function ‘____bpf_getsockopt’:
net/core/filter.c:4048:19: warning: unused variable ‘tp’ [-Wunused-variable]
  struct tcp_sock *tp;
                   ^~
net/core/filter.c:4046:31: warning: unused variable ‘icsk’ [-Wunused-variable]
  struct inet_connection_sock *icsk;
                               ^~~~
Move the variable declarations inside the {} block where they are used.

Fixes: 1e215300f1 ("bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for bpf_(set|get)sockopt")
Signed-off-by: Anders Roxell <anders.roxell@linaro.org>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-11 14:54:52 -07:00
Matias Karhumaa
4ba5175f2c Bluetooth: Use correct tfm to generate OOB data
In case local OOB data was generated and other device initiated pairing
claiming that it has got OOB data, following crash occurred:

[  222.847853] general protection fault: 0000 [#1] SMP PTI
[  222.848025] CPU: 1 PID: 42 Comm: kworker/u5:0 Tainted: G         C        4.18.0-custom #4
[  222.848158] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[  222.848307] Workqueue: hci0 hci_rx_work [bluetooth]
[  222.848416] RIP: 0010:compute_ecdh_secret+0x5a/0x270 [bluetooth]
[  222.848540] Code: 0c af f5 48 8b 3d 46 de f0 f6 ba 40 00 00 00 be c0 00 60 00 e8 b7 7b c5 f5 48 85 c0 0f 84 ea 01 00 00 48 89 c3 e8 16 0c af f5 <49> 8b 47 38 be c0 00 60 00 8b 78 f8 48 83 c7 48 e8 51 84 c5 f5 48
[  222.848914] RSP: 0018:ffffb1664087fbc0 EFLAGS: 00010293
[  222.849021] RAX: ffff8a5750d7dc00 RBX: ffff8a5671096780 RCX: ffffffffc08bc32a
[  222.849111] RDX: 0000000000000000 RSI: 00000000006000c0 RDI: ffff8a5752003800
[  222.849192] RBP: ffffb1664087fc60 R08: ffff8a57525280a0 R09: ffff8a5752003800
[  222.849269] R10: ffffb1664087fc70 R11: 0000000000000093 R12: ffff8a5674396e00
[  222.849350] R13: ffff8a574c2e79aa R14: ffff8a574c2e796a R15: 020e0e100d010101
[  222.849429] FS:  0000000000000000(0000) GS:ffff8a5752500000(0000) knlGS:0000000000000000
[  222.849518] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.849586] CR2: 000055856016a038 CR3: 0000000110d2c005 CR4: 00000000000606e0
[  222.849671] Call Trace:
[  222.849745]  ? sc_send_public_key+0x110/0x2a0 [bluetooth]
[  222.849825]  ? sc_send_public_key+0x115/0x2a0 [bluetooth]
[  222.849925]  smp_recv_cb+0x959/0x2490 [bluetooth]
[  222.850023]  ? _cond_resched+0x19/0x40
[  222.850105]  ? mutex_lock+0x12/0x40
[  222.850202]  l2cap_recv_frame+0x109d/0x3420 [bluetooth]
[  222.850315]  ? l2cap_recv_frame+0x109d/0x3420 [bluetooth]
[  222.850426]  ? __switch_to_asm+0x34/0x70
[  222.850515]  ? __switch_to_asm+0x40/0x70
[  222.850625]  ? __switch_to_asm+0x34/0x70
[  222.850724]  ? __switch_to_asm+0x40/0x70
[  222.850786]  ? __switch_to_asm+0x34/0x70
[  222.850846]  ? __switch_to_asm+0x40/0x70
[  222.852581]  ? __switch_to_asm+0x34/0x70
[  222.854976]  ? __switch_to_asm+0x40/0x70
[  222.857475]  ? __switch_to_asm+0x40/0x70
[  222.859775]  ? __switch_to_asm+0x34/0x70
[  222.861218]  ? __switch_to_asm+0x40/0x70
[  222.862327]  ? __switch_to_asm+0x34/0x70
[  222.863758]  l2cap_recv_acldata+0x266/0x3c0 [bluetooth]
[  222.865122]  hci_rx_work+0x1c9/0x430 [bluetooth]
[  222.867144]  process_one_work+0x210/0x4c0
[  222.868248]  worker_thread+0x41/0x4d0
[  222.869420]  kthread+0x141/0x160
[  222.870694]  ? process_one_work+0x4c0/0x4c0
[  222.871668]  ? kthread_create_worker_on_cpu+0x90/0x90
[  222.872896]  ret_from_fork+0x35/0x40
[  222.874132] Modules linked in: algif_hash algif_skcipher af_alg rfcomm bnep btusb btrtl btbcm btintel snd_intel8x0 cmac intel_rapl_perf vboxvideo(C) snd_ac97_codec bluetooth ac97_bus joydev ttm snd_pcm ecdh_generic drm_kms_helper snd_timer snd input_leds drm serio_raw fb_sys_fops soundcore syscopyarea sysfillrect sysimgblt mac_hid sch_fq_codel ib_iser rdma_cm iw_cm ib_cm ib_core iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi ip_tables x_tables autofs4 btrfs zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq libcrc32c raid1 raid0 multipath linear hid_generic usbhid hid crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ahci psmouse libahci i2c_piix4 video e1000 pata_acpi
[  222.883153] fbcon_switch: detected unhandled fb_set_par error, error code -16
[  222.886774] fbcon_switch: detected unhandled fb_set_par error, error code -16
[  222.890503] ---[ end trace 6504aa7a777b5316 ]---
[  222.890541] RIP: 0010:compute_ecdh_secret+0x5a/0x270 [bluetooth]
[  222.890551] Code: 0c af f5 48 8b 3d 46 de f0 f6 ba 40 00 00 00 be c0 00 60 00 e8 b7 7b c5 f5 48 85 c0 0f 84 ea 01 00 00 48 89 c3 e8 16 0c af f5 <49> 8b 47 38 be c0 00 60 00 8b 78 f8 48 83 c7 48 e8 51 84 c5 f5 48
[  222.890555] RSP: 0018:ffffb1664087fbc0 EFLAGS: 00010293
[  222.890561] RAX: ffff8a5750d7dc00 RBX: ffff8a5671096780 RCX: ffffffffc08bc32a
[  222.890565] RDX: 0000000000000000 RSI: 00000000006000c0 RDI: ffff8a5752003800
[  222.890571] RBP: ffffb1664087fc60 R08: ffff8a57525280a0 R09: ffff8a5752003800
[  222.890576] R10: ffffb1664087fc70 R11: 0000000000000093 R12: ffff8a5674396e00
[  222.890581] R13: ffff8a574c2e79aa R14: ffff8a574c2e796a R15: 020e0e100d010101
[  222.890586] FS:  0000000000000000(0000) GS:ffff8a5752500000(0000) knlGS:0000000000000000
[  222.890591] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  222.890594] CR2: 000055856016a038 CR3: 0000000110d2c005 CR4: 00000000000606e0

This commit fixes a bug where invalid pointer to crypto tfm was used for
SMP SC ECDH calculation when OOB was in use. Solution is to use same
crypto tfm than when generating OOB material on generate_oob() function.

This bug was introduced in commit c0153b0b90 ("Bluetooth: let the crypto
subsystem generate the ecc privkey"). Bug was found by fuzzing kernel SMP
implementation using Synopsys Defensics.

Signed-off-by: Matias Karhumaa <matias.karhumaa@gmail.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-09-11 13:33:57 +02:00
Johan Hedberg
94f14e4728 Bluetooth: SMP: Fix trying to use non-existent local OOB data
A remote device may claim that it has received our OOB data, even
though we never geneated it. Add a new flag to track whether we
actually have OOB data, and ignore the remote peer's flag if haven't
generated OOB data.

Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-09-11 13:33:57 +02:00
Steffen Klassert
9e14379378 xfrm: Fix NULL pointer dereference when skb_dst_force clears the dst_entry.
Since commit 222d7dbd25 ("net: prevent dst uses after free")
skb_dst_force() might clear the dst_entry attached to the skb.
The xfrm code don't expect this to happen, so we crash with
a NULL pointer dereference in this case. Fix it by checking
skb_dst(skb) for NULL after skb_dst_force() and drop the packet
in cast the dst_entry was cleared.

Fixes: 222d7dbd25 ("net: prevent dst uses after free")
Reported-by: Tobias Hommel <netdev-list@genoetigt.de>
Reported-by: Kristian Evensen <kristian.evensen@gmail.com>
Reported-by: Wolfgang Walter <linux@stwm.de>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-09-11 11:28:25 +02:00
Cong Wang
1286df269f netfilter: xt_hashlimit: use s->file instead of s->private
After switching to the new procfs API, it is supposed to
retrieve the private pointer from PDE_DATA(file_inode(s->file)),
s->private is no longer referred.

Fixes: 1cd6718272 ("netfilter/x_tables: switch to proc_create_seq_private")
Reported-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Christoph Hellwig <hch@lst.de>
Tested-by: Sami Farin <hvtaifwkbgefbaei@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-11 01:35:32 +02:00
Michal 'vorner' Vaner
ad18d7bf68 netfilter: nfnetlink_queue: Solve the NFQUEUE/conntrack clash for NF_REPEAT
NF_REPEAT places the packet at the beginning of the iptables chain
instead of accepting or rejecting it right away. The packet however will
reach the end of the chain and continue to the end of iptables
eventually, so it needs the same handling as NF_ACCEPT and NF_DROP.

Fixes: 368982cd7d ("netfilter: nfnetlink_queue: resolve clash for unconfirmed conntracks")
Signed-off-by: Michal 'vorner' Vaner <michal.vaner@avast.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-11 01:31:47 +02:00
Pablo Neira Ayuso
99e25d071f netfilter: cttimeout: ctnl_timeout_find_get() returns incorrect pointer to type
Compiler did not catch incorrect typing in the rcu hook assignment.

 % nfct add timeout test-tcp inet tcp established 100 close 10 close_wait 10
 % iptables -I OUTPUT -t raw -p tcp -j CT --timeout test-tcp
 dmesg - xt_CT: Timeout policy `test-tcp' can only be used by L3 protocol number 25000

The CT target bails out with incorrect layer 3 protocol number.

Fixes: 6c1fd7dc48 ("netfilter: cttimeout: decouple timeout policy from nfnetlink_cttimeout object")
Reported-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-11 01:31:10 +02:00
Pablo Neira Ayuso
a874752a10 netfilter: conntrack: timeout interface depend on CONFIG_NF_CONNTRACK_TIMEOUT
Now that cttimeout support for nft_ct is in place, these should depend
on CONFIG_NF_CONNTRACK_TIMEOUT otherwise we can crash when dumping the
policy if this option is not enabled.

[   71.600121] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[...]
[   71.600141] CPU: 3 PID: 7612 Comm: nft Not tainted 4.18.0+ #246
[...]
[   71.600188] Call Trace:
[   71.600201]  ? nft_ct_timeout_obj_dump+0xc6/0xf0 [nft_ct]

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-11 01:30:25 +02:00
Florian Westphal
f94e63801a netfilter: conntrack: reset tcp maxwin on re-register
Doug Smythies says:
  Sometimes it is desirable to temporarily disable, or clear,
  the iptables rule set on a computer being controlled via a
  secure shell session (SSH). While unwise on an internet facing
  computer, I also do it often on non-internet accessible computers
  while testing. Recently, this has become problematic, with the
  SSH session being dropped upon re-load of the rule set.

The problem is that when all rules are deleted, conntrack hooks get
unregistered.

In case the rules are re-added later, its possible that tcp window
has moved far enough so that all packets are considered invalid (out of
window) until entry expires (which can take forever, default
established timeout is 5 days).

Fix this by clearing maxwin of existing tcp connections on register.

v2: don't touch entries on hook removal.
v3: remove obsolete expiry check.

Reported-by: Doug Smythies <dsmythies@telus.net>
Fixes: 4d3a57f23d ("netfilter: conntrack: do not enable connection tracking unless needed")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-09-11 01:29:24 +02:00
Cong Wang
11957be20f htb: use anonymous union for simplicity
cl->leaf.q is slightly more readable than cl->un.leaf.q.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:39:57 -07:00
Cong Wang
8ecc7c8a1c net_sched: remove redundant qdisc lock classes
We no longer take any spinlock on RX path for ingress qdisc,
so this lockdep annotation is no longer needed.

Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:39:38 -07:00
Vlad Buslov
86c55361e5 net: sched: cls_flower: dump offload count value
Change flower in_hw_count type to fixed-size u32 and dump it as
TCA_FLOWER_IN_HW_COUNT. This change is necessary to properly test shared
blocks and re-offload functionality.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:35:15 -07:00
David S. Miller
992cba7e27 net: Add and use skb_list_del_init().
It documents what is happening, and eliminates the spurious list
pointer poisoning.

In the long term, in order to get proper list head debugging, we
might want to use the list poison value as the indicator that
an SKB is a singleton and not on a list.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:54 -07:00
David S. Miller
a8305bff68 net: Add and use skb_mark_not_on_list().
An SKB is not on a list if skb->next is NULL.

Codify this convention into a helper function and use it
where we are dequeueing an SKB and need to mark it as such.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:54 -07:00
David S. Miller
1181d629cc sctp: Use skb_queue_is_first().
Instead of direct skb_queue_head pointer accesses.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:53 -07:00
David S. Miller
7957a9dea8 mac80211: Don't access sk_queue_head->next directly.
Use __skb_peek() instead.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:53 -07:00
David S. Miller
596977300a sch_netem: Move private queue handler to generic location.
By hand copies of SKB list handlers do not belong in individual packet
schedulers.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:53 -07:00
David S. Miller
aea890b8b2 sch_htb: Remove local SKB queue handling code.
Instead, adjust __qdisc_enqueue_tail() such that HTB can use it
instead.

The only other caller of __qdisc_enqueue_tail() is
qdisc_enqueue_tail() so we can move the backlog and return value
handling (which HTB doesn't need/want) to the latter.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:06:52 -07:00
David Ahern
0153167aeb net/ipv6: Remove rt6i_prefsrc
After the conversion to fib6_info, rt6i_prefsrc has a single user that
reads the value and otherwise it is only set. The one reader can be
converted to use rt->from so rt6i_prefsrc can be removed, reducing
rt6_info by another 20 bytes.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-10 10:02:25 -07:00
Al Viro
702ec3072a hidp: fix compat_ioctl
1) no point putting it into fs/compat_ioctl.c when you handle it in
your ->compat_ioctl() anyway.
2) HIDPCONNADD is *not* COMPATIBLE_IOCTL() stuff at all - it does
layout massage (pointer-chasing there)
3) use compat_ptr()

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:07 -04:00
Al Viro
535221481a hidp: constify hidp_connection_add()
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:07 -04:00
Al Viro
89c0c24b4f cmtp: fix compat_ioctl
Use compat_ptr().  And don't mess with fs/compat_ioctl.c

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Al Viro
cc04f6e242 bnep: fix compat_ioctl
use compat_ptr() properly and don't bother with fs/compat_ioctl.c -
it's all handled in ->compat_ioctl() anyway.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
2018-09-10 12:41:06 -04:00
Yuan-Chi Pang
c420551057 mac80211: fix TX status reporting for ieee80211s
TX status reporting to ieee80211s is through ieee80211s_update_metric.
There are two problems about ieee80211s_update_metric:

1. The purpose is to estimate the fail probability
to a specific link. No need to restrict to data frame.

2. Current implementation does not work if wireless driver does not
pass tx_status with skb.

Fix this by removing ieee80211_is_data condition, passing
ieee80211_tx_status directly to ieee80211s_update_metric, and
putting it in both __ieee80211_tx_status and ieee80211_tx_status_ext.

Signed-off-by: Yuan-Chi Pang <fu3mo6goo@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-10 09:14:28 +02:00
Johannes Berg
cb59bc14e8 mac80211: TDLS: fix skb queue/priority assignment
If the TDLS setup happens over a connection to an AP that
doesn't have QoS, we nevertheless assign a non-zero TID
(skb->priority) and queue mapping, which may confuse us or
drivers later.

Fix it by just assigning the special skb->priority and then
using ieee80211_select_queue() just like other data frames
would go through.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-10 09:13:32 +02:00
Jouni Malinen
119f94a6fe cfg80211: Address some corner cases in scan result channel updating
cfg80211_get_bss_channel() is used to update the RX channel based on the
available frame payload information (channel number from DSSS Parameter
Set element or HT Operation element). This is needed on 2.4 GHz channels
where frames may be received on neighboring channels due to overlapping
frequency range.

This might of some use on the 5 GHz band in some corner cases, but
things are more complex there since there is no n:1 or 1:n mapping
between channel numbers and frequencies due to multiple different
starting frequencies in different operating classes. This could result
in ieee80211_channel_to_frequency() returning incorrect frequency and
ieee80211_get_channel() returning incorrect channel information (or
indication of no match). In the previous implementation, this could
result in some scan results being dropped completely, e.g., for the 4.9
GHz channels. That prevented connection to such BSSs.

Fix this by using the driver-provided channel pointer if
ieee80211_get_channel() does not find matching channel data for the
channel number in the frame payload and if the scan is done with 5 MHz
or 10 MHz channel bandwidth. While doing this, also add comments
describing what the function is trying to achieve to make it easier to
understand what happens here and why.

Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-10 09:13:09 +02:00
Taehee Yoo
5d407b071d ip: frags: fix crash in ip_do_fragment()
A kernel crash occurrs when defragmented packet is fragmented
in ip_do_fragment().
In defragment routine, skb_orphan() is called and
skb->ip_defrag_offset is set. but skb->sk and
skb->ip_defrag_offset are same union member. so that
frag->sk is not NULL.
Hence crash occurrs in skb->sk check routine in ip_do_fragment() when
defragmented packet is fragmented.

test commands:
   %iptables -t nat -I POSTROUTING -j MASQUERADE
   %hping3 192.168.4.2 -s 1000 -p 2000 -d 60000

splat looks like:
[  261.069429] kernel BUG at net/ipv4/ip_output.c:636!
[  261.075753] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[  261.083854] CPU: 1 PID: 1349 Comm: hping3 Not tainted 4.19.0-rc2+ #3
[  261.100977] RIP: 0010:ip_do_fragment+0x1613/0x2600
[  261.106945] Code: e8 e2 38 e3 fe 4c 8b 44 24 18 48 8b 74 24 08 e9 92 f6 ff ff 80 3c 02 00 0f 85 da 07 00 00 48 8b b5 d0 00 00 00 e9 25 f6 ff ff <0f> 0b 0f 0b 44 8b 54 24 58 4c 8b 4c 24 18 4c 8b 5c 24 60 4c 8b 6c
[  261.127015] RSP: 0018:ffff8801031cf2c0 EFLAGS: 00010202
[  261.134156] RAX: 1ffff1002297537b RBX: ffffed0020639e6e RCX: 0000000000000004
[  261.142156] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880114ba9bd8
[  261.150157] RBP: ffff880114ba8a40 R08: ffffed0022975395 R09: ffffed0022975395
[  261.158157] R10: 0000000000000001 R11: ffffed0022975394 R12: ffff880114ba9ca4
[  261.166159] R13: 0000000000000010 R14: ffff880114ba9bc0 R15: dffffc0000000000
[  261.174169] FS:  00007fbae2199700(0000) GS:ffff88011b400000(0000) knlGS:0000000000000000
[  261.183012] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  261.189013] CR2: 00005579244fe000 CR3: 0000000119bf4000 CR4: 00000000001006e0
[  261.198158] Call Trace:
[  261.199018]  ? dst_output+0x180/0x180
[  261.205011]  ? save_trace+0x300/0x300
[  261.209018]  ? ip_copy_metadata+0xb00/0xb00
[  261.213034]  ? sched_clock_local+0xd4/0x140
[  261.218158]  ? kill_l4proto+0x120/0x120 [nf_conntrack]
[  261.223014]  ? rt_cpu_seq_stop+0x10/0x10
[  261.227014]  ? find_held_lock+0x39/0x1c0
[  261.233008]  ip_finish_output+0x51d/0xb50
[  261.237006]  ? ip_fragment.constprop.56+0x220/0x220
[  261.243011]  ? nf_ct_l4proto_register_one+0x5b0/0x5b0 [nf_conntrack]
[  261.250152]  ? rcu_is_watching+0x77/0x120
[  261.255010]  ? nf_nat_ipv4_out+0x1e/0x2b0 [nf_nat_ipv4]
[  261.261033]  ? nf_hook_slow+0xb1/0x160
[  261.265007]  ip_output+0x1c7/0x710
[  261.269005]  ? ip_mc_output+0x13f0/0x13f0
[  261.273002]  ? __local_bh_enable_ip+0xe9/0x1b0
[  261.278152]  ? ip_fragment.constprop.56+0x220/0x220
[  261.282996]  ? nf_hook_slow+0xb1/0x160
[  261.287007]  raw_sendmsg+0x21f9/0x4420
[  261.291008]  ? dst_output+0x180/0x180
[  261.297003]  ? sched_clock_cpu+0x126/0x170
[  261.301003]  ? find_held_lock+0x39/0x1c0
[  261.306155]  ? stop_critical_timings+0x420/0x420
[  261.311004]  ? check_flags.part.36+0x450/0x450
[  261.315005]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.320995]  ? _raw_spin_unlock_irq+0x29/0x40
[  261.326142]  ? cyc2ns_read_end+0x10/0x10
[  261.330139]  ? raw_bind+0x280/0x280
[  261.334138]  ? sched_clock_cpu+0x126/0x170
[  261.338995]  ? check_flags.part.36+0x450/0x450
[  261.342991]  ? __lock_acquire+0x4500/0x4500
[  261.348994]  ? inet_sendmsg+0x11c/0x500
[  261.352989]  ? dst_output+0x180/0x180
[  261.357012]  inet_sendmsg+0x11c/0x500
[ ... ]

v2:
 - clear skb->sk at reassembly routine.(Eric Dumarzet)

Fixes: fa0f527358 ("ip: use rb trees for IP frag queue.")
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09 14:50:56 -07:00
Vakul Garg
52ea992cfa net/tls: Set count of SG entries if sk_alloc_sg returns -ENOSPC
tls_sw_sendmsg() allocates plaintext and encrypted SG entries using
function sk_alloc_sg(). In case the number of SG entries hit
MAX_SKB_FRAGS, sk_alloc_sg() returns -ENOSPC and sets the variable for
current SG index to '0'. This leads to calling of function
tls_push_record() with 'sg_encrypted_num_elem = 0' and later causes
kernel crash. To fix this, set the number of SG elements to the number
of elements in plaintext/encrypted SG arrays in case sk_alloc_sg()
returns -ENOSPC.

Fixes: 3c4d755915 ("tls: kernel TLS support")
Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-09 08:10:01 -07:00
Vlad Buslov
f20a4d0117 net: sched: act_nat: remove dependency on rtnl lock
According to the new locking rule, we have to take tcf_lock for both
->init() and ->dump(), as RTNL will be removed.

Use tcf spinlock to protect private nat action data from concurrent
modification during dump. (nat init already uses tcf spinlock when changing
action state)

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-08 10:18:25 -07:00
Vlad Buslov
6d7a8df6df net: sched: act_skbedit: remove dependency on rtnl lock
According to the new locking rule, we have to take tcf_lock for both
->init() and ->dump(), as RTNL will be removed.

Use tcf lock to protect skbedit action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)

Remove rtnl lock assertion that is no longer required.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-08 10:17:35 -07:00
Vincent Whitchurch
5cf4a8532c tcp: really ignore MSG_ZEROCOPY if no SO_ZEROCOPY
According to the documentation in msg_zerocopy.rst, the SO_ZEROCOPY
flag was introduced because send(2) ignores unknown message flags and
any legacy application which was accidentally passing the equivalent of
MSG_ZEROCOPY earlier should not see any new behaviour.

Before commit f214f915e7 ("tcp: enable MSG_ZEROCOPY"), a send(2) call
which passed the equivalent of MSG_ZEROCOPY without setting SO_ZEROCOPY
would succeed.  However, after that commit, it fails with -ENOBUFS.  So
it appears that the SO_ZEROCOPY flag fails to fulfill its intended
purpose.  Fix it.

Fixes: f214f915e7 ("tcp: enable MSG_ZEROCOPY")
Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Acked-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-07 23:11:06 -07:00
Cong Wang
a162c35114 net_sched: properly cancel netlink dump on failure
When nla_put*() fails after nla_nest_start(), we need
to call nla_nest_cancel() to cancel the message, otherwise
we end up calling nla_nest_end() like a success.

Fixes: 0ed5269f9e ("net/sched: add tunnel option support to act_tunnel_key")
Cc: Davide Caratti <dcaratti@redhat.com>
Cc: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-07 23:05:07 -07:00
Florian Fainelli
98cdb48071 net: dsa: Expose tagging protocol to user-space
There is no way for user-space to know what a given DSA network device's
tagging protocol is. Expose this information through a dsa/tagging
attribute which reflects the tagging protocol currently in use.

This is helpful for configuration (e.g: none behaves dramatically
different wrt. bridges) as well as for packet capture tools when there
is not a proper Ethernet type available.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-07 14:49:55 -07:00
Dominique Martinet
ef5305f1f7 9p: p9dirent_read: check network-provided name length
strcpy to dirent->d_name could overflow the buffer, use strscpy to check
the provided string length and error out if the size was too big.

While we are here, make the function return an error when the pdu
parsing failed, instead of returning the pdu offset as if it had been a
success...

Link: http://lkml.kernel.org/r/1536339057-21974-4-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 139133 ("Copy into fixed size buffer")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:52:29 +09:00
Dominique Martinet
473c7dd1d7 9p/rdma: remove useless check in cm_event_handler
the client c is always dereferenced to get the rdma struct, so c has to
be a valid pointer at this point.
Gcc would optimize that away but let's make coverity happy...

Link: http://lkml.kernel.org/r/1536339057-21974-3-git-send-email-asmadeus@codewreck.org
Addresses-Coverity-ID: 102778 ("Dereference before null check")
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:52:22 +09:00
Gertjan Halkes
2803cf4379 9p: do not trust pdu content for stat item size
v9fs_dir_readdir() could deadloop if a struct was sent with a size set
to -2

Link: http://lkml.kernel.org/r/1536134432-11997-1-git-send-email-asmadeus@codewreck.org
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=88021
Signed-off-by: Gertjan Halkes <gertjan@google.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:06 +09:00
Tomas Bortoli
6d35190f39 9p: Rename req to rreq in trans_fd
In struct p9_conn, rename req to rreq as it is used by the read routine.

Link: http://lkml.kernel.org/r/20180903160321.2181-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Suggested-by: Jun Piao <piaojun@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:40:04 +09:00
Dominique Martinet
8b894adb2b 9p/rdma: do not disconnect on down_interruptible EAGAIN
9p/rdma would sometimes drop the connection and display errors in
recv_done when the user does ^C.
The errors were caused by recv buffers that were posted at the time
of disconnect, and we just do not want to disconnect when
down_interruptible is... interrupted.

Link: http://lkml.kernel.org/r/1535625307-18019-1-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:39:47 +09:00
Tomas Bortoli
728356dede 9p: Add refcount to p9_req_t
To avoid use-after-free(s), use a refcount to keep track of the
usable references to any instantiated struct p9_req_t.

This commit adds p9_req_put(), p9_req_get() and p9_req_try_get() as
wrappers to kref_put(), kref_get() and kref_get_unless_zero().
These are used by the client and the transports to keep track of
valid requests' references.

p9_free_req() is added back and used as callback by kref_put().

Add SLAB_TYPESAFE_BY_RCU as it ensures that the memory freed by
kmem_cache_free() will not be reused for another type until the rcu
synchronisation period is over, so an address gotten under rcu read
lock is safe to inc_ref() without corrupting random memory while
the lock is held.

Link: http://lkml.kernel.org/r/1535626341-20693-1-git-send-email-asmadeus@codewreck.org
Co-developed-by: Dominique Martinet <dominique.martinet@cea.fr>
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+467050c1ce275af2a5b8@syzkaller.appspotmail.com
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:39:47 +09:00
Tomas Bortoli
43cbcbee99 9p: rename p9_free_req() function
In sight of the next patch to add a refcount in p9_req_t, rename
the p9_free_req() function in p9_release_req().

In the next patch the actual kfree will be moved to another function.

Link: http://lkml.kernel.org/r/20180811144254.23665-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-09-08 01:39:47 +09:00
Dominique Martinet
91a76be37f 9p: add a per-client fcall kmem_cache
Having a specific cache for the fcall allocations helps speed up
end-to-end latency.

The caches will automatically be merged if there are multiple caches
of items with the same size so we do not need to try to share a cache
between different clients of the same size.

Since the msize is negotiated with the server, only allocate the cache
after that negotiation has happened - previous allocations or
allocations of different sizes (e.g. zero-copy fcall) are made with
kmalloc directly.

Some figures on two beefy VMs with Connect-IB (sriov) / trans=rdma,
with ior running 32 processes in parallel doing small 32 bytes IOs:
 - no alloc (4.18-rc7 request cache): 65.4k req/s
 - non-power of two alloc, no patch: 61.6k req/s
 - power of two alloc, no patch: 62.2k req/s
 - non-power of two alloc, with patch: 64.7k req/s
 - power of two alloc, with patch: 65.1k req/s

Link: http://lkml.kernel.org/r/1532943263-24378-2-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Greg Kurz <groug@kaod.org>
2018-09-08 01:39:47 +09:00
Dominique Martinet
523adb6cc1 9p: embed fcall in req to round down buffer allocs
'msize' is often a power of two, or at least page-aligned, so avoiding
an overhead of two dozen bytes for each allocation will help the
allocator do its work and reduce memory fragmentation.

Link: http://lkml.kernel.org/r/1533825236-22896-1-git-send-email-asmadeus@codewreck.org
Suggested-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Reviewed-by: Greg Kurz <groug@kaod.org>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: Matthew Wilcox <willy@infradead.org>
2018-09-08 01:39:45 +09:00
Marek Lindner
4c4af69008 batman-adv: fix hardif_neigh refcount on queue_work() failure
The hardif_neigh refcounter is to be decreased by the queued work and
currently is never decreased if the queue_work() call fails.
Fix by checking the queue_work() return value and decrease refcount
if necessary.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-07 08:14:26 +02:00
Marek Lindner
5af96b9c59 batman-adv: fix backbone_gw refcount on queue_work() failure
The backbone_gw refcounter is to be decreased by the queued work and
currently is never decreased if the queue_work() call fails.
Fix by checking the queue_work() return value and decrease refcount
if necessary.

Signed-off-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-07 08:14:26 +02:00
Jesper Dangaard Brouer
47b123ed9e xdp: split code for map vs non-map redirect
The compiler does an efficient job of inlining static C functions.
Perf top clearly shows that almost everything gets inlined into the
function call xdp_do_redirect.

The function xdp_do_redirect end-up containing and interleaving the
map and non-map redirect code.  This is sub-optimal, as it would be
strange for an XDP program to use both types of redirect in the same
program. The two use-cases are separate, and interleaving the code
just cause more instruction-cache pressure.

I would like to stress (again) that the non-map variant bpf_redirect
is very slow compared to the bpf_redirect_map variant, approx half the
speed.  Measured with driver i40e the difference is:

- map     redirect: 13,250,350 pps
- non-map redirect:  7,491,425 pps

For this reason, the function name of the non-map variant of redirect
have been called xdp_do_redirect_slow.  This hopefully gives a hint
when using perf, that this is not the optimal XDP redirect operating mode.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-06 22:34:08 -07:00
Jesper Dangaard Brouer
2a68d85fe1 xdp: explicit inline __xdp_map_lookup_elem
The compiler chooses to not-inline the function __xdp_map_lookup_elem,
because it can see that it is used by both Generic-XDP and native-XDP
do redirect calls (xdp_do_generic_redirect_map and xdp_do_redirect_map).

The compiler cannot know that this is a bad choice, as it cannot know
that a net device cannot run both XDP modes (Generic or Native) at the
same time.  Thus, mark this function inline, even-though we normally
leave this up-to the compiler.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-06 22:34:08 -07:00
Jesper Dangaard Brouer
e1302542e3 xdp: unlikely instrumentation for xdp map redirect
Notice the compiler generated ASM code layout was suboptimal.  It
assumed map enqueue errors as the likely case, which is shouldn't.
It assumed that xdp_do_flush_map() was a likely case, due to maps
changing between packets, which should be very unlikely.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-09-06 22:34:08 -07:00
Cong Wang
8f5c5fcf35 tipc: call start and done ops directly in __tipc_nl_compat_dumpit()
__tipc_nl_compat_dumpit() uses a netlink_callback on stack,
so the only way to align it with other ->dumpit() call path
is calling tipc_dump_start() and tipc_dump_done() directly
inside it. Otherwise ->dumpit() would always get NULL from
cb->args[].

But tipc_dump_start() uses sock_net(cb->skb->sk) to retrieve
net pointer, the cb->skb here doesn't set skb->sk, the net pointer
is saved in msg->net instead, so introduce a helper function
__tipc_dump_start() to pass in msg->net.

Ying pointed out cb->args[0...3] are already used by other
callbacks on this call path, so we can't use cb->args[0] any
more, use cb->args[4] instead.

Fixes: 9a07efa9ae ("tipc: switch to rhashtable iterator")
Reported-and-tested-by: syzbot+e93a2c41f91b8e2c7d9b@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-06 21:49:18 -07:00
Yi-Hung Wei
fa642f0883 openvswitch: Derive IP protocol number for IPv6 later frags
Currently, OVS only parses the IP protocol number for the first
IPv6 fragment, but sets the IP protocol number for the later fragments
to be NEXTHDF_FRAGMENT.  This patch tries to derive the IP protocol
number for the IPV6 later frags so that we can match that.

Signed-off-by: Yi-Hung Wei <yihung.wei@gmail.com>
Acked-by: Pravin B Shelar <pshelar@ovn.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-06 21:47:49 -07:00
Sven Eckelmann
ae3cdc97dc batman-adv: Prevent duplicated tvlv handler
The function batadv_tvlv_handler_register is responsible for adding new
tvlv_handler to the handler_list. It first checks whether the entry
already is in the list or not. If it is, then the creation of a new entry
is aborted.

But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.

The check and the manipulation of the list must therefore be in the same
locked code section.

Fixes: ef26157747 ("batman-adv: tvlv - basic infrastructure")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 14:37:13 +02:00
Sven Eckelmann
e7136e48ff batman-adv: Prevent duplicated global TT entry
The function batadv_tt_global_orig_entry_add is responsible for adding new
tt_orig_list_entry to the orig_list. It first checks whether the entry
already is in the list or not. If it is, then the creation of a new entry
is aborted.

But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.

The check and the manipulation of the list must therefore be in the same
locked code section.

Fixes: d657e621a0 ("batman-adv: add reference counting for type batadv_tt_orig_list_entry")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 14:37:12 +02:00
Sven Eckelmann
94cb82f594 batman-adv: Prevent duplicated softif_vlan entry
The function batadv_softif_vlan_get is responsible for adding new
softif_vlan to the softif_vlan_list. It first checks whether the entry
already is in the list or not. If it is, then the creation of a new entry
is aborted.

But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.

The check and the manipulation of the list must therefore be in the same
locked code section.

Fixes: 5d2c05b213 ("batman-adv: add per VLAN interface attribute framework")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 14:37:05 +02:00
Sven Eckelmann
fa122fec86 batman-adv: Prevent duplicated nc_node entry
The function batadv_nc_get_nc_node is responsible for adding new nc_nodes
to the in_coding_list and out_coding_list. It first checks whether the
entry already is in the list or not. If it is, then the creation of a new
entry is aborted.

But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.

The check and the manipulation of the list must therefore be in the same
locked code section.

Fixes: d56b1705e2 ("batman-adv: network coding - detect coding nodes and remove these after timeout")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 13:55:58 +02:00
Sven Eckelmann
dff9bc42ab batman-adv: Prevent duplicated gateway_node entry
The function batadv_gw_node_add is responsible for adding new gw_node to
the gateway_list. It is expecting that the caller already checked that
there is not already an entry with the same key or not.

But the lock for the list is only held when the list is really modified.
This could lead to duplicated entries because another context could create
an entry with the same key between the check and the list manipulation.

The check and the manipulation of the list must therefore be in the same
locked code section.

Fixes: c6c8fea297 ("net: Add batman-adv meshing protocol")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 13:55:20 +02:00
Sven Eckelmann
a25bab9d72 batman-adv: Fix segfault when writing to sysfs elp_interval
The per hardif sysfs file "batman_adv/elp_interval" is using the generic
functions to store/show uint values. The helper __batadv_store_uint_attr
requires the softif net_device as parameter to print the resulting change
as info text when the users writes to this file. It uses the helper
function batadv_info to add it at the same time to the kernel ring buffer
and to the batman-adv debug log (when CONFIG_BATMAN_ADV_DEBUG is enabled).

The function batadv_info requires as first parameter the batman-adv softif
net_device. This parameter is then used to find the private buffer which
contains the debug log for this batman-adv interface. But
batadv_store_throughput_override used as first argument the slave
net_device. This slave device doesn't have the batadv_priv private data
which is access by batadv_info.

Writing to this file with CONFIG_BATMAN_ADV_DEBUG enabled can either lead
to a segfault or to memory corruption.

Fixes: 0744ff8fa8 ("batman-adv: Add hard_iface specific sysfs wrapper macros for UINT")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 13:54:48 +02:00
Sven Eckelmann
b9fd14c208 batman-adv: Fix segfault when writing to throughput_override
The per hardif sysfs file "batman_adv/throughput_override" prints the
resulting change as info text when the users writes to this file. It uses
the helper function batadv_info to add it at the same time to the kernel
ring buffer and to the batman-adv debug log (when CONFIG_BATMAN_ADV_DEBUG
is enabled).

The function batadv_info requires as first parameter the batman-adv softif
net_device. This parameter is then used to find the private buffer which
contains the debug log for this batman-adv interface. But
batadv_store_throughput_override used as first argument the slave
net_device. This slave device doesn't have the batadv_priv private data
which is access by batadv_info.

Writing to this file with CONFIG_BATMAN_ADV_DEBUG enabled can either lead
to a segfault or to memory corruption.

Fixes: 0b5ecc6811 ("batman-adv: add throughput override attribute to hard_ifaces")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Marek Lindner <mareklindner@neomailbox.ch>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 13:54:48 +02:00
Sven Eckelmann
88d0895d0e batman-adv: Avoid probe ELP information leak
The probe ELPs for WiFi interfaces are expanded to contain at least
BATADV_ELP_MIN_PROBE_SIZE bytes. This is usually a lot more than the
number of bytes which the template ELP packet requires.

These extra padding bytes were not initialized and thus could contain data
which were previously stored at the same location. It is therefore required
to set it to some predefined or random values to avoid leaking private
information from the system transmitting these kind of packets.

Fixes: e4623c913508 ("batman-adv: Avoid probe ELP information leak")
Signed-off-by: Sven Eckelmann <sven@narfation.org>
Acked-by: Antonio Quartulli <a@unstable.cc>
Signed-off-by: Simon Wunderlich <sw@simonwunderlich.de>
2018-09-06 13:54:47 +02:00
Julian Wiedmann
b7f4156554 net/iucv: declare iucv_path_table_empty() as static
Fixes a compile warning.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:32:22 -07:00
Julian Wiedmann
b2f543949a net/af_iucv: fix skb handling on HiperTransport xmit error
When sending an skb, afiucv_hs_send() bails out on various error
conditions. But currently the caller has no way of telling whether the
skb was freed or not - resulting in potentially either
a) leaked skbs from iucv_send_ctrl(), or
b) double-free's from iucv_sock_sendmsg().

As dev_queue_xmit() will always consume the skb (even on error), be
consistent and also free the skb from all other error paths. This way
callers no longer need to care about managing the skb.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:32:22 -07:00
Julian Wiedmann
222440996d net/af_iucv: drop inbound packets with invalid flags
Inbound packets may have any combination of flag bits set in their iucv
header. If we don't know how to handle a specific combination, drop the
skb instead of leaking it.

To clarify what error is returned in this case, replace the hard-coded
0 with the corresponding macro.

Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:32:21 -07:00
Christian Brauner
203651b665 ipv6: add inet6_fill_args
inet6_fill_if{addr,mcaddr, acaddr}() already took 6 arguments which
meant the 7th argument would need to be pushed onto the stack on x86.
Add a new struct inet6_fill_args which holds common information passed
to inet6_fill_if{addr,mcaddr, acaddr}() and shortens the functions to
three pointer arguments.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Christian Brauner
978a46fa6c ipv4: add inet_fill_args
inet_fill_ifaddr() already took 6 arguments which meant the 7th argument
would need to be pushed onto the stack on x86.
Add a new struct inet_fill_args which holds common information passed
to inet_fill_ifaddr() and shortens the function to three pointer arguments.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Kirill Tkhai <ktkhai@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Christian Brauner
7e4a8d5a93 rtnetlink: s/IFLA_IF_NETNSID/IFLA_TARGET_NETNSID/g
IFLA_TARGET_NETNSID is the new alias for IFLA_IF_NETNSID. This commit
replaces all occurrences of IFLA_IF_NETNSID with the new alias to
indicate that this identifier is the preferred one.

Signed-off-by: Christian Brauner <christian@brauner.io>
Cc: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Cc: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Christian Brauner
87ccbb1f94 rtnetlink: move type calculation out of loop
I don't see how the type - which is one of
RTM_{GETADDR,GETROUTE,GETNETCONF} - can change. So do the message type
calculation once before entering the for loop.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Christian Brauner
6ecf4c37eb ipv6: enable IFA_TARGET_NETNSID for RTM_GETADDR
- Backwards Compatibility:
  If userspace wants to determine whether ipv6 RTM_GETADDR requests
  support the new IFA_TARGET_NETNSID property it should verify that the
  reply includes the IFA_TARGET_NETNSID property. If it does not
  userspace should assume that IFA_TARGET_NETNSID is not supported for
  ipv6 RTM_GETADDR requests on this kernel.
- From what I gather from current userspace tools that make use of
  RTM_GETADDR requests some of them pass down struct ifinfomsg when they
  should actually pass down struct ifaddrmsg. To not break existing
  tools that pass down the wrong struct we will do the same as for
  RTM_GETLINK | NLM_F_DUMP requests and not error out when the
  nlmsg_parse() fails.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Christian Brauner
d38071455f ipv4: enable IFA_TARGET_NETNSID for RTM_GETADDR
- Backwards Compatibility:
  If userspace wants to determine whether ipv4 RTM_GETADDR requests
  support the new IFA_TARGET_NETNSID property it should verify that the
  reply includes the IFA_TARGET_NETNSID property. If it does not
  userspace should assume that IFA_TARGET_NETNSID is not supported for
  ipv4 RTM_GETADDR requests on this kernel.
- From what I gather from current userspace tools that make use of
  RTM_GETADDR requests some of them pass down struct ifinfomsg when they
  should actually pass down struct ifaddrmsg. To not break existing
  tools that pass down the wrong struct we will do the same as for
  RTM_GETLINK | NLM_F_DUMP requests and not error out when the
  nlmsg_parse() fails.

- Security:
  Callers must have CAP_NET_ADMIN in the owning user namespace of the
  target network namespace.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Christian Brauner
c383edc424 rtnetlink: add rtnl_get_net_ns_capable()
get_target_net() will be used in follow-up patches in ipv{4,6} codepaths to
retrieve network namespaces based on network namespace identifiers. So
remove the static declaration and export in the rtnetlink header. Also,
rename it to rtnl_get_net_ns_capable() to make it obvious what this
function is doing.
Export rtnl_get_net_ns_capable() so it can be used when ipv6 is built as
a module.

Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:27:11 -07:00
Davide Caratti
ee28bb56ac net/sched: fix memory leak in act_tunnel_key_init()
If users try to install act_tunnel_key 'set' rules with duplicate values
of 'index', the tunnel metadata are allocated, but never released. Then,
kmemleak complains as follows:

 # tc a a a tunnel_key set src_ip 1.1.1.1 dst_ip 2.2.2.2 id 42 index 111
 # echo clear > /sys/kernel/debug/kmemleak
 # tc a a a tunnel_key set src_ip 1.1.1.1 dst_ip 2.2.2.2 id 42 index 111
 Error: TC IDR already exists.
 We have an error talking to the kernel
 # echo scan > /sys/kernel/debug/kmemleak
 # cat /sys/kernel/debug/kmemleak
 unreferenced object 0xffff8800574e6c80 (size 256):
   comm "tc", pid 5617, jiffies 4298118009 (age 57.990s)
   hex dump (first 32 bytes):
     00 00 00 00 00 00 00 00 00 1c e8 b0 ff ff ff ff  ................
     81 24 c2 ad ff ff ff ff 00 00 00 00 00 00 00 00  .$..............
   backtrace:
     [<00000000b7afbf4e>] tunnel_key_init+0x8a5/0x1800 [act_tunnel_key]
     [<000000007d98fccd>] tcf_action_init_1+0x698/0xac0
     [<0000000099b8f7cc>] tcf_action_init+0x15c/0x590
     [<00000000dc60eebe>] tc_ctl_action+0x336/0x5c2
     [<000000002f5a2f7d>] rtnetlink_rcv_msg+0x357/0x8e0
     [<000000000bfe7575>] netlink_rcv_skb+0x124/0x350
     [<00000000edab656f>] netlink_unicast+0x40f/0x5d0
     [<00000000b322cdcb>] netlink_sendmsg+0x6e8/0xba0
     [<0000000063d9d490>] sock_sendmsg+0xb3/0xf0
     [<00000000f0d3315a>] ___sys_sendmsg+0x654/0x960
     [<00000000c06cbd42>] __sys_sendmsg+0xd3/0x170
     [<00000000ce72e4b0>] do_syscall_64+0xa5/0x470
     [<000000005caa2d97>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
     [<00000000fac1b476>] 0xffffffffffffffff

This problem theoretically happens also in case users attempt to setup a
geneve rule having wrong configuration data, or when the kernel fails to
allocate 'params_new'. Ensure that tunnel_key_init() releases the tunnel
metadata also in the above conditions.

Addresses-Coverity-ID: 1373974 ("Resource leak")
Fixes: d0f6dd8a91 ("net/sched: Introduce act_tunnel_key")
Fixes: 0ed5269f9e ("net/sched: add tunnel option support to act_tunnel_key")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:18:54 -07:00
Cong Wang
0a3b8b2b21 tipc: orphan sock in tipc_release()
Before we unlock the sock in tipc_release(), we have to
detach sk->sk_socket from sk, otherwise a parallel
tipc_sk_fill_sock_diag() could stil read it after we
free this socket.

Fixes: c30b70deb5 ("tipc: implement socket diagnostics for AF_TIPC")
Reported-and-tested-by: syzbot+48804b87c16588ad491d@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:14:00 -07:00
Dmitry Safonov
428f944bd5 netlink: Make groups check less stupid in netlink_bind()
As Linus noted, the test for 0 is needless, groups type can follow the
usual kernel style and 8*sizeof(unsigned long) is BITS_PER_LONG:

> The code [..] isn't technically incorrect...
> But it is stupid.
> Why stupid? Because the test for 0 is pointless.
>
> Just doing
>        if (nlk->ngroups < 8*sizeof(groups))
>                groups &= (1UL << nlk->ngroups) - 1;
>
> would have been fine and more understandable, since the "mask by shift
> count" already does the right thing for a ngroups value of 0. Now that
> test for zero makes me go "what's special about zero?". It turns out
> that the answer to that is "nothing".
[..]
> The type of "groups" is kind of silly too.
>
> Yeah, "long unsigned int" isn't _technically_ wrong. But we normally
> call that type "unsigned long".

Cleanup my piece of pointlessness.

Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: netdev@vger.kernel.org
Fairly-blamed-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:11:33 -07:00
Vincent Whitchurch
fa788d986a packet: add sockopt to ignore outgoing packets
Currently, the only way to ignore outgoing packets on a packet socket is
via the BPF filter.  With MSG_ZEROCOPY, packets that are looped into
AF_PACKET are copied in dev_queue_xmit_nit(), and this copy happens even
if the filter run from packet_rcv() would reject them.  So the presence
of a packet socket on the interface takes away the benefits of
MSG_ZEROCOPY, even if the packet socket is not interested in outgoing
packets.  (Even when MSG_ZEROCOPY is not used, the skb is unnecessarily
cloned, but the cost for that is much lower.)

Add a socket option to allow AF_PACKET sockets to ignore outgoing
packets to solve this.  Note that the *BSDs already have something
similar: BIOCSSEESENT/BIOCSDIRECTION and BIOCSDIRFILT.

The first intended user is lldpd.

Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-05 22:09:37 -07:00
Bob Copeland
6eae4a6c2b mac80211: fix pending queue hang due to TX_DROP
In our environment running lots of mesh nodes, we are seeing the
pending queue hang periodically, with the debugfs queues file showing
lines such as:

    00: 0x00000000/348

i.e. there are a large number of frames but no stop reason set.

One way this could happen is if queue processing from the pending
tasklet exited early without processing all frames, and without having
some future event (incoming frame, stop reason flag, ...) to reschedule
it.

Exactly this can occur today if ieee80211_tx() returns false due to
packet drops or power-save buffering in the tx handlers.  In the
past, this function would return true in such cases, and the change
to false doesn't seem to be intentional.  Fix this case by reverting
to the previous behavior.

Fixes: bb42f2d13f ("mac80211: Move reorder-sensitive TX handlers to after TXQ dequeue")
Signed-off-by: Bob Copeland <bobcopeland@fb.com>
Acked-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 12:54:27 +02:00
Stanislaw Gruszka
014f5a250f cfg80211: validate wmm rule when setting
Add validation check for wmm rule when copy rules from fwdb and print
error when rule is invalid.

Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:16:59 +02:00
Gustavo A. R. Silva
40b5a0f8c6 mac80211: remove unnecessary NULL check
Both old and new cannot be NULL at the same time, hence checking
new when old is not NULL is unnecessary.

Also, notice that new is being dereferenced before it is checked:

	idx = new->conf.keyidx;

The above triggers a static code analysis warning.

Address this by removing the NULL check on new and adding a code
comment based on the following piece of code:

387        /* caller must provide at least one old/new */
388        if (WARN_ON(!new && !old))
389                return 0;

Addresses-Coverity-ID: 1473176 ("Dereference before null check")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:15:48 +02:00
Sara Sharon
9739fe29a2 mac80211: add an option for drivers to check if packets can be aggregated
Some hardwares have limitations on the packets' type in AMSDU.
Add an optional driver callback to determine if two skbs can
be used in the same AMSDU or not.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:11:50 +02:00
Sara Sharon
edba6bdad6 mac80211: allow AMSDU size limitation per-TID
Some drivers may have AMSDU size limitation per TID, due to
HW constrains. Add an option to set this limit.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:10:26 +02:00
Sara Sharon
0eeb2b674f mac80211: add an option for station management TXQ
We have a TXQ abstraction for non-data packets that need
powersave buffering. Since the AP cannot sleep, in case
of station we can use this TXQ for all management frames,
regardless if they are bufferable. Add HW flag to allow
that.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:10:11 +02:00
Shaul Triebitz
add7453ad6 wireless: align to draft 11ax D3.0
Align to new 11ax draft D3.0.  Change/add new MAC and PHY capabilities
and update drivers' 11ax capabilities and mac80211's debugfs
accordingly.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:09:50 +02:00
Naftali Goldstein
77cbbc35a4 mac80211: fix saving a few HE values
After masking the he_oper_params, to get the requested values as
integers one must rshift and not lshift.  Fix that by using the
le32_get_bits() macro.

Fixes: 41cbb0f5a2 ("mac80211: add support for HE")
Signed-off-by: Naftali Goldstein <naftali.goldstein@intel.com>
[converted to use le32_get_bits()]
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:08:42 +02:00
Shaul Triebitz
c3d1f87528 mac80211: support reporting 0-length PSDU in radiotap
For certain sounding frames, it may be useful to report them
to userspace even though they don't have a PSDU in order to
determine the PHY parameters (e.g. VHT rate/stream config.)
Add support for this to mac80211.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:08:25 +02:00
Andrei Otcheretianski
24f33e64fc cfg80211: reg: Init wiphy_idx in regulatory_hint_core()
Core regulatory hints didn't set wiphy_idx to WIPHY_IDX_INVALID. Since
the regulatory request is zeroed, wiphy_idx was always implicitly set to
0. This resulted in updating only phy #0.
Fix that.

Fixes: 806a9e3967 ("cfg80211: make regulatory_request use wiphy_idx instead of wiphy")
Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[add fixes tag]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:07:53 +02:00
Andrei Otcheretianski
94a5b3acd0 mac80211: Don't wake up from PS for offchannel TX
Otherwise the offchannel frame might be queued due to
IEEE80211_QUEUE_STOP_REASON_PS and later dropped (in
ieee80211_tx_frags()).  Anyway, it doesn't make much sense to wake up
the device during ROC.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:05:02 +02:00
Andrei Otcheretianski
8682250b3c mac80211: Always report TX status
If a frame is dropped for any reason, mac80211 wouldn't report the TX
status back to user space.

As the user space may rely on the TX_STATUS to kick its state
machines, resends etc, it's better to just report this frame as not
acked instead.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:04:51 +02:00
Alexander Wetzel
62872a9b9a mac80211: Fix PTK rekey freezes and clear text leak
Rekeying PTK keys without "Extended Key ID for Individually Addressed
Frames" did use a procedure not suitable to replace in-use keys and
could caused the following issues:

 1) Freeze caused by incoming frames:
    If the local STA installed the key prior to the remote STA we still
    had the old key active in the hardware when mac80211 switched over
    to the new key.
    Therefore there was a window where the card could hand over frames
    decoded with the old key to mac80211 and bump the new PN (IV) value
    to an incorrect high number. When it happened the local replay
    detection silently started to drop all frames sent with the new key.

 2) Freeze caused by outgoing frames:
    If mac80211 was providing the PN (IV) and handed over a clear text
    frame for encryption to the hardware prior to a key change the
    driver/card could have processed the queued frame after switching
    to the new key. This bumped the PN value on the remote STA to an
    incorrect high number, tricking the remote STA to discard all frames
    we sent later.

 3) Freeze caused by RX aggregation reorder buffer:
    An aggregation session started with the old key and ending after the
    switch to the new key also bumped the PN to an incorrect high number,
    freezing the connection quite similar to 1).

 4) Freeze caused by repeating lost frames in an aggregation session:
    A driver could repeat a lost frame and encrypt it with the new key
    while in a TX aggregation session without updating the PN for the
    new key. This also could freeze connections similar to 2).

 5) Clear text leak:
    Removing encryption offload from the card cleared the encryption
    offload flag only after the card had deleted the key and we did not
    stop TX during the rekey. The driver/card could therefore get
    unencrypted frames from mac80211 while no longer be instructed to
    encrypt them.

To prevent those issues the key install logic has been changed:
 - Mac80211 divers known to be able to rekey PTK0 keys have to set
   @NL80211_EXT_FEATURE_CAN_REPLACE_PTK0,
 - mac80211 stops queuing frames depending on the key during the replace
 - the key is first replaced in the hardware and after that in mac80211
 - and mac80211 stops/blocks new aggregation sessions during the rekey.

For drivers not setting
@NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 the user space must avoid PTK
rekeys if "Extended Key ID for Individually Addressed Frames" is not
being used. Rekeys for mac80211 drivers without this flag will generate a
warning and use an extra call to ieee80211_flush_queues() to both
highlight and try to prevent the issues with not updated drivers.

The core of the fix changes the key install procedure from:
 - atomic switch over to the new key in mac80211
 - remove the old key in the hardware (stops encryption offloading, fall
   back to software encryption with a potential clear text packet leak
   in between)
 - delete the inactive old key in mac80211
 - enable hardware encryption offloading for the new key
to:
 - if it's a PTK mark the old key as tainted to drop TX frames with the
   outgoing key
 - replace the key in hardware with the new one
 - atomic switch over to the new (not marked as tainted) key in
   mac80211 (which also resumes TX)
 - delete the inactive old key in mac80211

With the new sequence the hardware will be unable to decrypt frames
encrypted with the old key prior to switching to the new key in mac80211
and thus prevent PNs from packets decrypted with the old key to be
accounted against the new key.

For that to work the drivers have to provide a clear boundary.
Mac80211 drivers setting @NL80211_EXT_FEATURE_CAN_REPLACE_PTK0 confirm
to provide it and mac80211 will then be able to correctly rekey in-use
PTK keys with those drivers.

The mac80211 requirements for drivers to set the flag have been added to
the "Hardware crypto acceleration" documentation section. It drills down
to:
The drivers must not hand over frames decrypted with the old key to
mac80211 once the call to set_key() with %DISABLE_KEY has been
completed. It's allowed to either drop or continue to use the old key
for any outgoing frames which are already in the queues, but it must not
send out any of them unencrypted or encrypted with the new key.

Even with the new boundary in place aggregation sessions with the
reorder buffer are problematic:
RX aggregation session started prior and completed after the rekey could
still dump frames received with the old key at mac80211 after it
switched over to the new key. This is side stepped by stopping all (RX
and TX) aggregation sessions when replacing a PTK key and hardware key
offloading.
Stopping TX aggregation sessions avoids the need to get
the PNs (IVs) updated in frames prepared for the old key and
(re)transmitted after the switch to the new key. As a bonus it improves
the compatibility when the remote STA is not handling rekeys as it
should.

When using software crypto aggregation sessions are not stopped.
Mac80211 won't be able to decode the dangerous frames and discard them
without special handling.

Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de>
[trim overly long rekey warning]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:17 +02:00
Shaul Triebitz
d1332e7be2 mac80211: support radiotap L-SIG data
As before with HE, the data needs to be provided by the
driver in the skb head, since there's not enough space
in the skb CB.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:15 +02:00
Wen Gong
70e53669c4 mac80211: Store sk_pacing_shift in ieee80211_hw
Make it possibly for drivers to adjust the default skb_pacing_shift
by storing it in the hardware struct.

Signed-off-by: Wen Gong <wgong@codeaurora.org>
[adjust commit log, move & adjust comment]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:15 +02:00
Johannes Berg
e80d642552 mac80211: copy VHT EXT NSS BW Support/Capable data to station
When taking VHT capabilities for a station, copy the new
fields if we support them as a transmitter. Also adjust
the maximum bandwidth the station supports appropriately.

Also, since it was missing, copy tx_highest and rx_highest.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:14 +02:00
Johannes Berg
7eb26df297 mac80211: add ability to parse CCFS2
With newer VHT implementations, it's necessary to look at the
HT operation's CCFS2 field to identify the actual bandwidth
used.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:14 +02:00
Johannes Berg
09b4a4faf9 mac80211: introduce capability flags for VHT EXT NSS support
Depending on whether or not rate control supports selecting
rates depending on the bandwidth, we can use VHT extended
NSS support. In essence, this is dot11VHTExtendedNSSBWCapable
from the spec, since depending on that we'll need to parse
the bandwidth.

If needed, also set/clear the VHT Capability Element bit for
this capability so that we don't advertise it erroneously or
don't advertise it when we actually use it.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:14 +02:00
Johannes Berg
b0aa75f0b1 ieee80211: add new VHT capability fields/parsing
IEEE 802.11-2016 extended the VHT capability fields to allow
indicating the number of spatial streams depending on the
actually used bandwidth, add support for decoding this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:14 +02:00
Shaul Triebitz
34fb190ec0 mac80211: in AP mode, set bss_conf::he_supported
In AP mode, If AP advertises HE capabilities, set to true
bss_conf::he_supported so that the Driver knows about it.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:13 +02:00
Shaul Triebitz
244eb9ae79 cfg80211: add he_capabilities (ext) IE to AP settings
Same as for HT and VHT.
This helps the lower level to know whether the AP supports HE.

Signed-off-by: Shaul Triebitz <shaul.triebitz@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:13 +02:00
Johannes Berg
adf8ed01e4 mac80211: add an optional TXQ for other PS-buffered frames
Some drivers may want to also use the TXQ abstraction with
non-data packets that need powersave buffering, so add a
hardware flag to allow this.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:13 +02:00
Johannes Berg
331aead520 mac80211: use le16_encode_bits() instead of open-coding
Instead of open-coding the equivalent of le16_encode_bits(),
just use that.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:13 +02:00
Johannes Berg
6f527287e0 mac80211: remove pointless 'params' NULL checks
These checks aren't necessary, cfg80211 never passes NULL.
Some static checkers complain about the missing checks on
the next line, but really the NULL checks are unnecessary.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:13 +02:00
Ido Yariv
80aaa9c164 mac80211: Add he_capa debugfs entry
Export HE capabilities information via debugfs, similar to HT & VHT.

Signed-off-by: Ido Yariv <idox.yariv@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-05 10:03:13 +02:00
David S. Miller
36302685f5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net 2018-09-04 21:33:03 -07:00
Linus Torvalds
28619527b8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Must perform TXQ teardown before unregistering interfaces in
    mac80211, from Toke Høiland-Jørgensen.

 2) Don't allow creating mac80211_hwsim with less than one channel, from
    Johannes Berg.

 3) Division by zero in cfg80211, fix from Johannes Berg.

 4) Fix endian issue in tipc, from Haiqing Bai.

 5) BPF sockmap use-after-free fixes from Daniel Borkmann.

 6) Spectre-v1 in mac80211_hwsim, from Jinbum Park.

 7) Missing rhashtable_walk_exit() in tipc, from Cong Wang.

 8) Revert kvzalloc() conversion of AF_PACKET, it breaks mmap() when
    kvzalloc() tries to use kmalloc() pages. From Eric Dumazet.

 9) Fix deadlock in hv_netvsc, from Dexuan Cui.

10) Do not restart timewait timer on RST, from Florian Westphal.

11) Fix double lwstate refcount grab in ipv6, from Alexey Kodanev.

12) Unsolicit report count handling is off-by-one, fix from Hangbin Liu.

13) Sleep-in-atomic in cadence driver, from Jia-Ju Bai.

14) Respect ttl-inherit in ip6 tunnel driver, from Hangbin Liu.

15) Use-after-free in act_ife, fix from Cong Wang.

16) Missing hold to meta module in act_ife, from Vlad Buslov.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (91 commits)
  net: phy: sfp: Handle unimplemented hwmon limits and alarms
  net: sched: action_ife: take reference to meta module
  act_ife: fix a potential use-after-free
  net/mlx5: Fix SQ offset in QPs with small RQ
  tipc: correct spelling errors for tipc_topsrv_queue_evt() comments
  tipc: correct spelling errors for struct tipc_bc_base's comment
  bnxt_en: Do not adjust max_cp_rings by the ones used by RDMA.
  bnxt_en: Clean up unused functions.
  bnxt_en: Fix firmware signaled resource change logic in open.
  sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel
  sctp: fix invalid reference to the index variable of the iterator
  net/ibm/emac: wrong emac_calc_base call was used by typo
  net: sched: null actions array pointer before releasing action
  vhost: fix VHOST_GET_BACKEND_FEATURES ioctl request definition
  r8169: add support for NCube 8168 network card
  ip6_tunnel: respect ttl inherit for ip6tnl
  mac80211: shorten the IBSS debug messages
  mac80211: don't Tx a deauth frame if the AP forbade Tx
  mac80211: Fix station bandwidth setting after channel switch
  mac80211: fix a race between restart and CSA flows
  ...
2018-09-04 12:45:11 -07:00
Vlad Buslov
84cb8eb26c net: sched: action_ife: take reference to meta module
Recent refactoring of add_metainfo() caused use_all_metadata() to add
metainfo to ife action metalist without taking reference to module. This
causes warning in module_put called from ife action cleanup function.

Implement add_metainfo_and_get_ops() function that returns with reference
to module taken if metainfo was added successfully, and call it from
use_all_metadata(), instead of calling __add_metainfo() directly.

Example warning:

[  646.344393] WARNING: CPU: 1 PID: 2278 at kernel/module.c:1139 module_put+0x1cb/0x230
[  646.352437] Modules linked in: act_meta_skbtcindex act_meta_mark act_meta_skbprio act_ife ife veth nfsv3 nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c tun ebtable_filter ebtables ip6table_filter ip6_tables bridge stp llc mlx5_ib ib_uverbs ib_core intel_rapl sb_edac x86_pkg_temp_thermal mlx5_core coretemp kvm_intel kvm nfsd igb irqbypass crct10dif_pclmul devlink crc32_pclmul mei_me joydev ses crc32c_intel enclosure auth_rpcgss i2c_algo_bit ioatdma ptp mei pps_core ghash_clmulni_intel iTCO_wdt iTCO_vendor_support pcspkr dca ipmi_ssif lpc_ich target_core_mod i2c_i801 ipmi_si ipmi_devintf pcc_cpufreq wmi ipmi_msghandler nfs_acl lockd acpi_pad acpi_power_meter grace sunrpc mpt3sas raid_class scsi_transport_sas
[  646.425631] CPU: 1 PID: 2278 Comm: tc Not tainted 4.19.0-rc1+ #799
[  646.432187] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  646.440595] RIP: 0010:module_put+0x1cb/0x230
[  646.445238] Code: f3 66 94 02 e8 26 ff fa ff 85 c0 74 11 0f b6 1d 51 30 94 02 80 fb 01 77 60 83 e3 01 74 13 65 ff 0d 3a 83 db 73 e9 2b ff ff ff <0f> 0b e9 00 ff ff ff e8 59 01 fb ff 85 c0 75 e4 48 c7 c2 20 62 6b
[  646.464997] RSP: 0018:ffff880354d37068 EFLAGS: 00010286
[  646.470599] RAX: 0000000000000000 RBX: ffffffffc0a52518 RCX: ffffffff8c2668db
[  646.478118] RDX: 0000000000000003 RSI: dffffc0000000000 RDI: ffffffffc0a52518
[  646.485641] RBP: ffffffffc0a52180 R08: fffffbfff814a4a4 R09: fffffbfff814a4a3
[  646.493164] R10: ffffffffc0a5251b R11: fffffbfff814a4a4 R12: 1ffff1006a9a6e0d
[  646.500687] R13: 00000000ffffffff R14: ffff880362bab890 R15: dead000000000100
[  646.508213] FS:  00007f4164c99800(0000) GS:ffff88036fe40000(0000) knlGS:0000000000000000
[  646.516961] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  646.523080] CR2: 00007f41638b8420 CR3: 0000000351df0004 CR4: 00000000001606e0
[  646.530595] Call Trace:
[  646.533408]  ? find_symbol_in_section+0x260/0x260
[  646.538509]  tcf_ife_cleanup+0x11b/0x200 [act_ife]
[  646.543695]  tcf_action_cleanup+0x29/0xa0
[  646.548078]  __tcf_action_put+0x5a/0xb0
[  646.552289]  ? nla_put+0x65/0xe0
[  646.555889]  __tcf_idr_release+0x48/0x60
[  646.560187]  tcf_generic_walker+0x448/0x6b0
[  646.564764]  ? tcf_action_dump_1+0x450/0x450
[  646.569411]  ? __lock_is_held+0x84/0x110
[  646.573720]  ? tcf_ife_walker+0x10c/0x20f [act_ife]
[  646.578982]  tca_action_gd+0x972/0xc40
[  646.583129]  ? tca_get_fill.constprop.17+0x250/0x250
[  646.588471]  ? mark_lock+0xcf/0x980
[  646.592324]  ? check_chain_key+0x140/0x1f0
[  646.596832]  ? debug_show_all_locks+0x240/0x240
[  646.601839]  ? memset+0x1f/0x40
[  646.605350]  ? nla_parse+0xca/0x1a0
[  646.609217]  tc_ctl_action+0x215/0x230
[  646.613339]  ? tcf_action_add+0x220/0x220
[  646.617748]  rtnetlink_rcv_msg+0x56a/0x6d0
[  646.622227]  ? rtnl_fdb_del+0x3f0/0x3f0
[  646.626466]  netlink_rcv_skb+0x18d/0x200
[  646.630752]  ? rtnl_fdb_del+0x3f0/0x3f0
[  646.634959]  ? netlink_ack+0x500/0x500
[  646.639106]  netlink_unicast+0x2d0/0x370
[  646.643409]  ? netlink_attachskb+0x340/0x340
[  646.648050]  ? _copy_from_iter_full+0xe9/0x3e0
[  646.652870]  ? import_iovec+0x11e/0x1c0
[  646.657083]  netlink_sendmsg+0x3b9/0x6a0
[  646.661388]  ? netlink_unicast+0x370/0x370
[  646.665877]  ? netlink_unicast+0x370/0x370
[  646.670351]  sock_sendmsg+0x6b/0x80
[  646.674212]  ___sys_sendmsg+0x4a1/0x520
[  646.678443]  ? copy_msghdr_from_user+0x210/0x210
[  646.683463]  ? lock_downgrade+0x320/0x320
[  646.687849]  ? debug_show_all_locks+0x240/0x240
[  646.692760]  ? do_raw_spin_unlock+0xa2/0x130
[  646.697418]  ? _raw_spin_unlock+0x24/0x30
[  646.701798]  ? __handle_mm_fault+0x1819/0x1c10
[  646.706619]  ? __pmd_alloc+0x320/0x320
[  646.710738]  ? debug_show_all_locks+0x240/0x240
[  646.715649]  ? restore_nameidata+0x7b/0xa0
[  646.720117]  ? check_chain_key+0x140/0x1f0
[  646.724590]  ? check_chain_key+0x140/0x1f0
[  646.729070]  ? __fget_light+0xbc/0xd0
[  646.733121]  ? __sys_sendmsg+0xd7/0x150
[  646.737329]  __sys_sendmsg+0xd7/0x150
[  646.741359]  ? __ia32_sys_shutdown+0x30/0x30
[  646.746003]  ? up_read+0x53/0x90
[  646.749601]  ? __do_page_fault+0x484/0x780
[  646.754105]  ? do_syscall_64+0x1e/0x2c0
[  646.758320]  do_syscall_64+0x72/0x2c0
[  646.762353]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  646.767776] RIP: 0033:0x7f4163872150
[  646.771713] Code: 8b 15 3c 7d 2b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 83 3d b9 d5 2b 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be cd 00 00 48 89 04 24
[  646.791474] RSP: 002b:00007ffdef7d6b58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  646.799721] RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 00007f4163872150
[  646.807240] RDX: 0000000000000000 RSI: 00007ffdef7d6bd0 RDI: 0000000000000003
[  646.814760] RBP: 000000005b8b9482 R08: 0000000000000001 R09: 0000000000000000
[  646.822286] R10: 00000000000005e7 R11: 0000000000000246 R12: 00007ffdef7dad20
[  646.829807] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000679bc0
[  646.837360] irq event stamp: 6083
[  646.841043] hardirqs last  enabled at (6081): [<ffffffff8c220a7d>] __call_rcu+0x17d/0x500
[  646.849882] hardirqs last disabled at (6083): [<ffffffff8c004f06>] trace_hardirqs_off_thunk+0x1a/0x1c
[  646.859775] softirqs last  enabled at (5968): [<ffffffff8d4004a1>] __do_softirq+0x4a1/0x6ee
[  646.868784] softirqs last disabled at (6082): [<ffffffffc0a78759>] tcf_ife_cleanup+0x39/0x200 [act_ife]
[  646.878845] ---[ end trace b1b8c12ffe51e657 ]---

Fixes: 5ffe57da29 ("act_ife: fix a potential deadlock")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-04 12:20:21 -07:00
Cong Wang
6d784f1625 act_ife: fix a potential use-after-free
Immediately after module_put(), user could delete this
module, so e->ops could be already freed before we call
e->ops->release().

Fix this by moving module_put() after ops->release().

Fixes: ef6980b6be ("introduce IFE action")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-04 12:18:25 -07:00
Sowmini Varadhan
782710e333 xfrm: reset crypto_done when iterating over multiple input xfrms
We only support one offloaded xfrm (we do not have devices that
can handle more than one offload), so reset crypto_done in
xfrm_input() when iterating over multiple transforms in xfrm_input,
so that we can invoke the appropriate x->type->input for the
non-offloaded transforms

Fixes: d77e38e612 ("xfrm: Add an IPsec hardware offloading API")
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-09-04 10:27:07 +02:00
Sowmini Varadhan
bfc0698beb xfrm: reset transport header back to network header after all input transforms ahave been applied
A policy may have been set up with multiple transforms (e.g., ESP
and ipcomp). In this situation, the ingress IPsec processing
iterates in xfrm_input() and applies each transform in turn,
processing the nexthdr to find any additional xfrm that may apply.

This patch resets the transport header back to network header
only after the last transformation so that subsequent xfrms
can find the correct transport header.

Fixes: 7785bba299 ("esp: Add a software GRO codepath")
Suggested-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-09-04 10:26:30 +02:00
David S. Miller
fc3e3bf55f Here are quite a large number of fixes, notably:
* various A-MSDU building fixes (currently only affects mt76)
  * syzkaller & spectre fixes in hwsim
  * TXQ vs. teardown fix that was causing crashes
  * embed WMM info in reg rule, bad code here had been causing crashes
  * one compilation issue with fix from Arnd (rfkill-gpio includes)
  * fixes for a race and bad data during/after channel switch
  * nl80211: a validation fix, attribute type & unit fixes
 along with other small fixes.
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAluNJXcACgkQB8qZga/f
 l8Qvfw//dBwlhMII862Evk4M8OzhdHfkJ4Kp/d2C476whbEySU/jRIIeetmVpXYV
 5cfStTxBpGkwMj5PXy3DaA2PO++L5qaApDJfHc8DNWNmvt9rRRJul1zP05HjZRxW
 G7aFCFRWVK0dlmVP9GC/b20KyUvz4OpklBnxylkIrx0FCkw5bAHs1SsjGZCg/6Tm
 008DAhFz3Ds6hNLxwricvrk5oQ6eC1cDfDd4Rtk3jCYQ4t7KFn5gFoKzKldfLdWe
 TFTpVQ26XAGzn9QVXzAiXN4ZNpUpZrFXosC7cn5Ugiyic4YtnHxS2wVDuL3vs1cL
 J2hoW6wjEBg+U6vmHMcijo1lnQwW7ueYUDWLJPNIXHA6A7sGyA6z6D7vbbvHfoG6
 L681BrYmTmKkXXquu5+r85/9WgP2cmzbRpoIxTQl3sU2Liw2k5IJ9ryLLyul+8z7
 spnDPOY7h4c0JrAvhjHkrKIbbW4FKYunxZJ8dn9eyAzOd/58iKoXzu4yAggwm+0V
 DtZiu0gSr52sKrh1vqEyfhrPFCN1Mc19DRsJBtabUfVEveQTwToCkbZ5s1sLqSId
 m30XUjjYOiRk7MZnncar0lE4//eJ6bnL3Wie3UTmO3xsMwlgKQPqjI4TprNogUCk
 R2dVeGmhm3HSriRHKJL3/D8uzw5mMBI3Kicw9tFSSyVjtJgxvpg=
 =lLBA
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2018-09-03' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Here are quite a large number of fixes, notably:
 * various A-MSDU building fixes (currently only affects mt76)
 * syzkaller & spectre fixes in hwsim
 * TXQ vs. teardown fix that was causing crashes
 * embed WMM info in reg rule, bad code here had been causing crashes
 * one compilation issue with fix from Arnd (rfkill-gpio includes)
 * fixes for a race and bad data during/after channel switch
 * nl80211: a validation fix, attribute type & unit fixes
along with other small fixes.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 22:12:02 -07:00
Zhenbo Gao
a484ef3442 tipc: correct spelling errors for tipc_topsrv_queue_evt() comments
tipc_conn_queue_evt -> tipc_topsrv_queue_evt
tipc_send_work -> tipc_conn_send_work
tipc_send_to_sock -> tipc_conn_send_to_sock

Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 22:03:07 -07:00
Zhenbo Gao
9cc1bf3928 tipc: correct spelling errors for struct tipc_bc_base's comment
Trivial fix for two spelling mistakes.

Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 22:03:07 -07:00
Xin Long
741880e1f2 sctp: not traverse asoc trans list if non-ipv6 trans exists for ipv6_flowlabel
When users set params.spp_address and get a trans, ipv6_flowlabel flag
should be applied into this trans. But even if this one is not an ipv6
trans, it should not go to apply it into all other transes of the asoc
but simply ignore it.

Fixes: 0b0dce7a36 ("sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 21:57:54 -07:00
Xin Long
af8a2b8ba7 sctp: fix invalid reference to the index variable of the iterator
Now in sctp_apply_peer_addr_params(), if SPP_IPV6_FLOWLABEL flag is set
and trans is NULL, it would use trans as the index variable to traverse
transport_addr_list, then trans is set as the last transport of it.

Later, if SPP_DSCP flag is set, it would enter into the wrong branch as
trans is actually an invalid reference.

So fix it by using a new index variable to traverse transport_addr_list
for both SPP_DSCP and SPP_IPV6_FLOWLABEL flags process.

Fixes: 0b0dce7a36 ("sctp: add spp_ipv6_flowlabel and spp_dscp for sctp_paddrparams")
Reported-by: Julia Lawall <julia.lawall@lip6.fr>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 21:57:54 -07:00
Vlad Buslov
c10bbfae3a net: sched: null actions array pointer before releasing action
Currently, tcf_action_delete() nulls actions array pointer after putting
and deleting it. However, if tcf_idr_delete_index() returns an error,
pointer to action is not set to null. That results it being released second
time in error handling code of tca_action_gd().

Kasan error:

[  807.367755] ==================================================================
[  807.375844] BUG: KASAN: use-after-free in tc_setup_cb_call+0x14e/0x250
[  807.382763] Read of size 8 at addr ffff88033e636000 by task tc/2732

[  807.391289] CPU: 0 PID: 2732 Comm: tc Tainted: G        W         4.19.0-rc1+ #799
[  807.399542] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  807.407948] Call Trace:
[  807.410763]  dump_stack+0x92/0xeb
[  807.414456]  print_address_description+0x70/0x360
[  807.419549]  kasan_report+0x14d/0x300
[  807.423582]  ? tc_setup_cb_call+0x14e/0x250
[  807.428150]  tc_setup_cb_call+0x14e/0x250
[  807.432539]  ? nla_put+0x65/0xe0
[  807.436146]  fl_dump+0x394/0x3f0 [cls_flower]
[  807.440890]  ? fl_tmplt_dump+0x140/0x140 [cls_flower]
[  807.446327]  ? lock_downgrade+0x320/0x320
[  807.450702]  ? lock_acquire+0xe2/0x220
[  807.454819]  ? is_bpf_text_address+0x5/0x140
[  807.459475]  ? memcpy+0x34/0x50
[  807.462980]  ? nla_put+0x65/0xe0
[  807.466582]  tcf_fill_node+0x341/0x430
[  807.470717]  ? tcf_block_put+0xe0/0xe0
[  807.474859]  tcf_node_dump+0xdb/0xf0
[  807.478821]  fl_walk+0x8e/0x170 [cls_flower]
[  807.483474]  tcf_chain_dump+0x35a/0x4d0
[  807.487703]  ? tfilter_notify+0x170/0x170
[  807.492091]  ? tcf_fill_node+0x430/0x430
[  807.496411]  tc_dump_tfilter+0x362/0x3f0
[  807.500712]  ? tc_del_tfilter+0x850/0x850
[  807.505104]  ? kasan_unpoison_shadow+0x30/0x40
[  807.509940]  ? __mutex_unlock_slowpath+0xcf/0x410
[  807.515031]  netlink_dump+0x263/0x4f0
[  807.519077]  __netlink_dump_start+0x2a0/0x300
[  807.523817]  ? tc_del_tfilter+0x850/0x850
[  807.528198]  rtnetlink_rcv_msg+0x46a/0x6d0
[  807.532671]  ? rtnl_fdb_del+0x3f0/0x3f0
[  807.536878]  ? tc_del_tfilter+0x850/0x850
[  807.541280]  netlink_rcv_skb+0x18d/0x200
[  807.545570]  ? rtnl_fdb_del+0x3f0/0x3f0
[  807.549773]  ? netlink_ack+0x500/0x500
[  807.553913]  netlink_unicast+0x2d0/0x370
[  807.558212]  ? netlink_attachskb+0x340/0x340
[  807.562855]  ? _copy_from_iter_full+0xe9/0x3e0
[  807.567677]  ? import_iovec+0x11e/0x1c0
[  807.571890]  netlink_sendmsg+0x3b9/0x6a0
[  807.576192]  ? netlink_unicast+0x370/0x370
[  807.580684]  ? netlink_unicast+0x370/0x370
[  807.585154]  sock_sendmsg+0x6b/0x80
[  807.589015]  ___sys_sendmsg+0x4a1/0x520
[  807.593230]  ? copy_msghdr_from_user+0x210/0x210
[  807.598232]  ? do_wp_page+0x174/0x880
[  807.602276]  ? __handle_mm_fault+0x749/0x1c10
[  807.607021]  ? __handle_mm_fault+0x1046/0x1c10
[  807.611849]  ? __pmd_alloc+0x320/0x320
[  807.615973]  ? check_chain_key+0x140/0x1f0
[  807.620450]  ? check_chain_key+0x140/0x1f0
[  807.624929]  ? __fget_light+0xbc/0xd0
[  807.628970]  ? __sys_sendmsg+0xd7/0x150
[  807.633172]  __sys_sendmsg+0xd7/0x150
[  807.637201]  ? __ia32_sys_shutdown+0x30/0x30
[  807.641846]  ? up_read+0x53/0x90
[  807.645442]  ? __do_page_fault+0x484/0x780
[  807.649949]  ? do_syscall_64+0x1e/0x2c0
[  807.654164]  do_syscall_64+0x72/0x2c0
[  807.658198]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  807.663625] RIP: 0033:0x7f42e9870150
[  807.667568] Code: 8b 15 3c 7d 2b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb cd 66 0f 1f 44 00 00 83 3d b9 d5 2b 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 be cd 00 00 48 89 04 24
[  807.687328] RSP: 002b:00007ffdbf595b58 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  807.695564] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f42e9870150
[  807.703083] RDX: 0000000000000000 RSI: 00007ffdbf595b80 RDI: 0000000000000003
[  807.710605] RBP: 00007ffdbf599d90 R08: 0000000000679bc0 R09: 000000000000000f
[  807.718127] R10: 00000000000005e7 R11: 0000000000000246 R12: 00007ffdbf599d88
[  807.725651] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000

[  807.735048] Allocated by task 2687:
[  807.738902]  kasan_kmalloc+0xa0/0xd0
[  807.742852]  __kmalloc+0x118/0x2d0
[  807.746615]  tcf_idr_create+0x44/0x320
[  807.750738]  tcf_nat_init+0x41e/0x530 [act_nat]
[  807.755638]  tcf_action_init_1+0x4e0/0x650
[  807.760104]  tcf_action_init+0x1ce/0x2d0
[  807.764395]  tcf_exts_validate+0x1d8/0x200
[  807.768861]  fl_change+0x55a/0x26b4 [cls_flower]
[  807.773845]  tc_new_tfilter+0x748/0xa20
[  807.778051]  rtnetlink_rcv_msg+0x56a/0x6d0
[  807.782517]  netlink_rcv_skb+0x18d/0x200
[  807.786804]  netlink_unicast+0x2d0/0x370
[  807.791095]  netlink_sendmsg+0x3b9/0x6a0
[  807.795387]  sock_sendmsg+0x6b/0x80
[  807.799240]  ___sys_sendmsg+0x4a1/0x520
[  807.803445]  __sys_sendmsg+0xd7/0x150
[  807.807473]  do_syscall_64+0x72/0x2c0
[  807.811506]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  807.818776] Freed by task 2728:
[  807.822283]  __kasan_slab_free+0x122/0x180
[  807.826752]  kfree+0xf4/0x2f0
[  807.830080]  __tcf_action_put+0x5a/0xb0
[  807.834281]  tcf_action_put_many+0x46/0x70
[  807.838747]  tca_action_gd+0x232/0xc40
[  807.842862]  tc_ctl_action+0x215/0x230
[  807.846977]  rtnetlink_rcv_msg+0x56a/0x6d0
[  807.851444]  netlink_rcv_skb+0x18d/0x200
[  807.855731]  netlink_unicast+0x2d0/0x370
[  807.860021]  netlink_sendmsg+0x3b9/0x6a0
[  807.864312]  sock_sendmsg+0x6b/0x80
[  807.868166]  ___sys_sendmsg+0x4a1/0x520
[  807.872372]  __sys_sendmsg+0xd7/0x150
[  807.876401]  do_syscall_64+0x72/0x2c0
[  807.880431]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

[  807.887704] The buggy address belongs to the object at ffff88033e636000
                which belongs to the cache kmalloc-256 of size 256
[  807.900909] The buggy address is located 0 bytes inside of
                256-byte region [ffff88033e636000, ffff88033e636100)
[  807.913155] The buggy address belongs to the page:
[  807.918322] page:ffffea000cf98d80 count:1 mapcount:0 mapping:ffff88036f80ee00 index:0x0 compound_mapcount: 0
[  807.928831] flags: 0x5fff8000008100(slab|head)
[  807.933647] raw: 005fff8000008100 ffffea000db44f00 0000000400000004 ffff88036f80ee00
[  807.942050] raw: 0000000000000000 0000000080190019 00000001ffffffff 0000000000000000
[  807.950456] page dumped because: kasan: bad access detected

[  807.958240] Memory state around the buggy address:
[  807.963405]  ffff88033e635f00: fc fc fc fc fb fb fb fb fb fb fb fc fc fc fc fb
[  807.971288]  ffff88033e635f80: fb fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc
[  807.979166] >ffff88033e636000: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  807.994882]                    ^
[  807.998477]  ffff88033e636080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[  808.006352]  ffff88033e636100: fc fc fc fc fc fc fc fc fb fb fb fb fb fb fb fb
[  808.014230] ==================================================================
[  808.022108] Disabling lock debugging due to kernel taint

Fixes: edfaf94fa7 ("net_sched: improve and refactor tcf_action_put_many()")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 21:47:33 -07:00
Zhenbo Gao
05a6843c92 tipc: correct structure parameter comments for topsrv
Remove the following obsolete parameter comments of tipc_topsrv struct:
  @rcvbuf_cache
  @tipc_conn_new
  @tipc_conn_release
  @tipc_conn_recvmsg
  @imp
  @type

Add the comments for the missing parameters below of tipc_topsrv struct:
  @awork
  @listener

Remove the unused or duplicated parameter comments of tipc_conn struct:
  @outqueue_lock
  @rx_action

Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com>
Reviewed-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 21:43:42 -07:00
Hangbin Liu
36feaac354 ip6_tunnel: respect ttl inherit for ip6tnl
man ip-tunnel ttl section says:
0 is a special value meaning that packets inherit the TTL value.

IPv4 tunnel respect this in ip_tunnel_xmit(), but IPv6 tunnel has not
implement it yet. To make IPv6 behave consistently with IP tunnel,
add ipv6 tunnel inherit support.

Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-03 19:04:12 -07:00
Emmanuel Grumbach
c6e57b3896 mac80211: shorten the IBSS debug messages
When tracing is enabled, all the debug messages are recorded and must
not exceed MAX_MSG_LEN (100) columns. Longer debug messages grant the
user with:

WARNING: CPU: 3 PID: 32642 at /tmp/wifi-core-20180806094828/src/iwlwifi-stack-dev/net/mac80211/./trace_msg.h:32 trace_event_raw_event_mac80211_msg_event+0xab/0xc0 [mac80211]
Workqueue: phy1 ieee80211_iface_work [mac80211]
 RIP: 0010:trace_event_raw_event_mac80211_msg_event+0xab/0xc0 [mac80211]
 Call Trace:
  __sdata_dbg+0xbd/0x120 [mac80211]
  ieee80211_ibss_rx_queued_mgmt+0x15f/0x510 [mac80211]
  ieee80211_iface_work+0x21d/0x320 [mac80211]

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:41:27 +02:00
Emmanuel Grumbach
6c18b27d6e mac80211: don't Tx a deauth frame if the AP forbade Tx
If the driver fails to properly prepare for the channel
switch, mac80211 will disconnect. If the CSA IE had mode
set to 1, it means that the clients are not allowed to send
any Tx on the current channel, and that includes the
deauthentication frame.

Make sure that we don't send the deauthentication frame in
this case.

In iwlwifi, this caused a failure to flush queues since the
firmware already closed the queues after having parsed the
CSA IE. Then mac80211 would wait until the deauthentication
frame would go out (drv_flush(drop=false)) and that would
never happen.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:39:34 +02:00
Ilan Peer
0007e94355 mac80211: Fix station bandwidth setting after channel switch
When performing a channel switch flow for a managed interface, the
flow did not update the bandwidth of the AP station and the rate
scale algorithm. In case of a channel width downgrade, this would
result with the rate scale algorithm using a bandwidth that does not
match the interface channel configuration.

Fix this by updating the AP station bandwidth and rate scaling algorithm
before the actual channel change in case of a bandwidth downgrade, or
after the actual channel change in case of a bandwidth upgrade.

Signed-off-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:39:11 +02:00
Emmanuel Grumbach
f3ffb6c3a2 mac80211: fix a race between restart and CSA flows
We hit a problem with iwlwifi that was caused by a bug in
mac80211. A bug in iwlwifi caused the firwmare to crash in
certain cases in channel switch. Because of that bug,
drv_pre_channel_switch would fail and trigger the restart
flow.
Now we had the hw restart worker which runs on the system's
workqueue and the csa_connection_drop_work worker that runs
on mac80211's workqueue that can run together. This is
obviously problematic since the restart work wants to
reconfigure the connection, while the csa_connection_drop_work
worker does the exact opposite: it tries to disconnect.

Fix this by cancelling the csa_connection_drop_work worker
in the restart worker.

Note that this can sound racy: we could have:

driver   iface_work   CSA_work   restart_work
+++++++++++++++++++++++++++++++++++++++++++++
              |
 <--drv_cs ---|
<FW CRASH!>
-CS FAILED-->
              |                       |
              |                 cancel_work(CSA)
           schedule                   |
           CSA work                   |
                         |            |
                        Race between those 2

But this is not possible because we flush the workqueue
in the restart worker before we cancel the CSA worker.
That would be bullet proof if we could guarantee that
we schedule the CSA worker only from the iface_work
which runs on the workqueue (and not on the system's
workqueue), but unfortunately we do have an instance
in which we schedule the CSA work outside the context
of the workqueue (ieee80211_chswitch_done).

Note also that we should probably cancel other workers
like beacon_connection_loss_work and possibly others
for different types of interfaces, at the very least,
IBSS should suffer from the exact same problem, but for
now, do the minimum to fix the actual bug that was actually
experienced and reproduced.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:38:39 +02:00
Dreyfuss, Haim
abd76d255d mac80211: fix WMM TXOP calculation
In commit 9236c4523e5b ("mac80211: limit wmm params to comply
with ETSI requirements"), we have limited the WMM parameters to
comply with 802.11 and ETSI standard.  Mistakenly the TXOP value
was caluclated wrong.  Fix it by taking the minimum between
802.11 to ETSI to make sure we are not violating both.

Fixes: e552af0581 ("mac80211: limit wmm params to comply with ETSI requirements")
Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:35:03 +02:00
Dan Carpenter
8442938c3a cfg80211: fix a type issue in ieee80211_chandef_to_operating_class()
The "chandef->center_freq1" variable is a u32 but "freq" is a u16 so we
are truncating away the high bits.  I noticed this bug because in commit
9cf0a0b4b6 ("cfg80211: Add support for 60GHz band channels 5 and 6")
we made "freq <= 56160 + 2160 * 6" a valid requency when before it was
only "freq <= 56160 + 2160 * 4" that was valid.  It introduces a static
checker warning:

    net/wireless/util.c:1571 ieee80211_chandef_to_operating_class()
    warn: always true condition '(freq <= 56160 + 2160 * 6) => (0-u16max <= 69120)'

But really we probably shouldn't have been truncating the high bits
away to begin with.

Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:34:12 +02:00
Lorenzo Bianconi
66eb02d839 mac80211: fix an off-by-one issue in A-MSDU max_subframe computation
Initialize 'n' to 2 in order to take into account also the first
packet in the estimation of max_subframe limit for a given A-MSDU
since frag_tail pointer is NULL when ieee80211_amsdu_aggregate
routine analyzes the second frame.

Fixes: 6e0456b545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-09-03 10:29:14 +02:00
Thadeu Lima de Souza Cascardo
215ab0f021 xfrm6: call kfree_skb when skb is toobig
After commit d6990976af ("vti6: fix PMTU caching
and reporting on xmit"), some too big skbs might be potentially passed down to
__xfrm6_output, causing it to fail to transmit but not free the skb, causing a
leak of skb, and consequentially a leak of dst references.

After running pmtu.sh, that shows as failure to unregister devices in a namespace:

[  311.397671] unregister_netdevice: waiting for veth_b to become free. Usage count = 1

The fix is to call kfree_skb in case of transmit failures.

Fixes: dd767856a3 ("xfrm6: Don't call icmpv6_send on local error")
Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@canonical.com>
Reviewed-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-09-03 07:37:57 +02:00
Yafang Shao
743e481580 tcp: remove useless add operation when init sysctl_max_tw_buckets
cp_hashinfo.ehash_mask is always an odd number, which is set in function
alloc_large_system_hash(). See bellow,
        if (_hash_mask)
                *_hash_mask = (1 << log2qty) - 1; <<< always odd number

Hence the local variable 'cnt' is a even number, as a result of that it is
no difference to do the incrementation here.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 16:12:41 -07:00
David S. Miller
a80afe89d8 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-09-02

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix one remaining buggy offset override in sockmap's bpf_msg_pull_data()
   when linearizing multiple scatterlist elements, from Tushar.

2) Fix BPF sockmap's misuse of ULP when a collision with another ULP is
   found on map update where it would release existing ULP. syzbot found and
   triggered this couple of times now, fix from John.

3) Add missing xskmap type to bpftool so it will properly show the type
   on map dump, from Prashant.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 15:53:38 -07:00
David Ahern
15a81b418e net/ipv6: Only update MTU metric if it set
Jan reported a regression after an update to 4.18.5. In this case ipv6
default route is setup by systemd-networkd based on data from an RA. The
RA contains an MTU of 1492 which is used when the route is first inserted
but then systemd-networkd pushes down updates to the default route
without the mtu set.

Prior to the change to fib6_info, metrics such as MTU were held in the
dst_entry and rt6i_pmtu in rt6_info contained an update to the mtu if
any. ip6_mtu would look at rt6i_pmtu first and use it if set. If not,
the value from the metrics is used if it is set and finally falling
back to the idev value.

After the fib6_info change metrics are contained in the fib6_info struct
and there is no equivalent to rt6i_pmtu. To maintain consistency with
the old behavior the new code should only reset the MTU in the metrics
if the route update has it set.

Fixes: d4ead6b34b ("net/ipv6: move metrics from dst to rt6_info")
Reported-by: Jan Janssen <medhefgo@web.de>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 14:03:54 -07:00
Hangbin Liu
ff06525fcb igmp: fix incorrect unsolicit report count after link down and up
After link down and up, i.e. when call ip_mc_up(), we doesn't init
im->unsolicit_count. So after igmp_timer_expire(), we will not start
timer again and only send one unsolicit report at last.

Fix it by initializing im->unsolicit_count in igmp_group_added(), so
we can respect igmp robustness value.

Fixes: 24803f38a5 ("igmp: do not remove igmp souce list info when set link down")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 13:39:37 -07:00
Hangbin Liu
4fb7253e4f igmp: fix incorrect unsolicit report count when join group
We should not start timer if im->unsolicit_count equal to 0 after decrease.
Or we will send one more unsolicit report message. i.e. 3 instead of 2 by
default.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-02 13:39:37 -07:00
Tushar Dave
9db39f4d4f bpf: Fix bpf_msg_pull_data()
Helper bpf_msg_pull_data() mistakenly reuses variable 'offset' while
linearizing multiple scatterlist elements. Variable 'offset' is used
to find first starting scatterlist element
    i.e. msg->data = sg_virt(&sg[first_sg]) + start - offset"

Use different variable name while linearizing multiple scatterlist
elements so that value contained in variable 'offset' won't get
overwritten.

Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Tushar Dave <tushar.n.dave@oracle.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-02 22:29:53 +02:00
Linus Torvalds
501dacbc24 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull core fixes from Thomas Gleixner:
 "A small set of updates for core code:

   - Prevent tracing in functions which are called from trace patching
     via stop_machine() to prevent executing half patched function trace
     entries.

   - Remove old GCC workarounds

   - Remove pointless includes of notifier.h"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  objtool: Remove workaround for unreachable warnings from old GCC
  notifier: Remove notifier header file wherever not used
  watchdog: Mark watchdog touch functions as notrace
2018-09-02 09:41:45 -07:00
Arnd Bergmann
2de9d505fb rds: store socket timestamps as ktime_t
rds is the last in-kernel user of the old do_gettimeofday()
function. Convert it over to ktime_get_real() to make it
work more like the generic socket timestamps, and to let
us kill off do_gettimeofday().

A follow-up patch will have to change the user space interface
to deal better with 32-bit tasks, which may use an incompatible
layout for 'struct timespec'.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-01 19:52:50 -07:00
Vakul Garg
94524d8fc9 net/tls: Add support for async decryption of tls records
When tls records are decrypted using asynchronous acclerators such as
NXP CAAM engine, the crypto apis return -EINPROGRESS. Presently, on
getting -EINPROGRESS, the tls record processing stops till the time the
crypto accelerator finishes off and returns the result. This incurs a
context switch and is not an efficient way of accessing the crypto
accelerators. Crypto accelerators work efficient when they are queued
with multiple crypto jobs without having to wait for the previous ones
to complete.

The patch submits multiple crypto requests without having to wait for
for previous ones to complete. This has been implemented for records
which are decrypted in zero-copy mode. At the end of recvmsg(), we wait
for all the asynchronous decryption requests to complete.

The references to records which have been sent for async decryption are
dropped. For cases where record decryption is not possible in zero-copy
mode, asynchronous decryption is not used and we wait for decryption
crypto api to complete.

For crypto requests executing in async fashion, the memory for
aead_request, sglists and skb etc is freed from the decryption
completion handler. The decryption completion handler wakesup the
sleeping user context when recvmsg() flags that it has done sending
all the decryption requests and there are no more decryption requests
pending to be completed.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Reviewed-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-01 19:52:50 -07:00
Alexey Kodanev
93bbadd6e0 ipv6: don't get lwtstate twice in ip6_rt_copy_init()
Commit 80f1a0f4e0 ("net/ipv6: Put lwtstate when destroying fib6_info")
partially fixed the kmemleak [1], lwtstate can be copied from fib6_info,
with ip6_rt_copy_init(), and it should be done only once there.

rt->dst.lwtstate is set by ip6_rt_init_dst(), at the start of the function
ip6_rt_copy_init(), so there is no need to get it again at the end.

With this patch, lwtstate also isn't copied from RTF_REJECT routes.

[1]:
unreferenced object 0xffff880b6aaa14e0 (size 64):
  comm "ip", pid 10577, jiffies 4295149341 (age 1273.903s)
  hex dump (first 32 bytes):
    01 00 04 00 04 00 00 00 10 00 00 00 00 00 00 00  ................
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [<0000000018664623>] lwtunnel_build_state+0x1bc/0x420
    [<00000000b73aa29a>] ip6_route_info_create+0x9f7/0x1fd0
    [<00000000ee2c5d1f>] ip6_route_add+0x14/0x70
    [<000000008537b55c>] inet6_rtm_newroute+0xd9/0xe0
    [<000000002acc50f5>] rtnetlink_rcv_msg+0x66f/0x8e0
    [<000000008d9cd381>] netlink_rcv_skb+0x268/0x3b0
    [<000000004c893c76>] netlink_unicast+0x417/0x5a0
    [<00000000f2ab1afb>] netlink_sendmsg+0x70b/0xc30
    [<00000000890ff0aa>] sock_sendmsg+0xb1/0xf0
    [<00000000a2e7b66f>] ___sys_sendmsg+0x659/0x950
    [<000000001e7426c8>] __sys_sendmsg+0xde/0x170
    [<00000000fe411443>] do_syscall_64+0x9f/0x4a0
    [<000000001be7b28b>] entry_SYSCALL_64_after_hwframe+0x49/0xbe
    [<000000006d21f353>] 0xffffffffffffffff

Fixes: 6edb3c96a5 ("net/ipv6: Defer initialization of dst to data path")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-09-01 17:42:12 -07:00
Andy Shevchenko
459479da97 bridge: Switch to bitmap_zalloc()
Switch to bitmap_zalloc() to show clearly what we are allocating.
Besides that it returns pointer of bitmap type instead of opaque void *.

Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:13:04 -07:00
Florian Westphal
63cc357f7b tcp: do not restart timewait timer on rst reception
RFC 1337 says:
 ''Ignore RST segments in TIME-WAIT state.
   If the 2 minute MSL is enforced, this fix avoids all three hazards.''

So with net.ipv4.tcp_rfc1337=1, expected behaviour is to have TIME-WAIT sk
expire rather than removing it instantly when a reset is received.

However, Linux will also re-start the TIME-WAIT timer.

This causes connect to fail when tying to re-use ports or very long
delays (until syn retry interval exceeds MSL).

packetdrill test case:
// Demonstrate bogus rearming of TIME-WAIT timer in rfc1337 mode.
`sysctl net.ipv4.tcp_rfc1337=1`

0.000 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
0.000 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
0.000 bind(3, ..., ...) = 0
0.000 listen(3, 1) = 0

0.100 < S 0:0(0) win 29200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.100 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>
0.200 < . 1:1(0) ack 1 win 257
0.200 accept(3, ..., ...) = 4

// Receive first segment
0.310 < P. 1:1001(1000) ack 1 win 46

// Send one ACK
0.310 > . 1:1(0) ack 1001

// read 1000 byte
0.310 read(4, ..., 1000) = 1000

// Application writes 100 bytes
0.350 write(4, ..., 100) = 100
0.350 > P. 1:101(100) ack 1001

// ACK
0.500 < . 1001:1001(0) ack 101 win 257

// close the connection
0.600 close(4) = 0
0.600 > F. 101:101(0) ack 1001 win 244

// Our side is in FIN_WAIT_1 & waits for ack to fin
0.7 < . 1001:1001(0) ack 102 win 244

// Our side is in FIN_WAIT_2 with no outstanding data.
0.8 < F. 1001:1001(0) ack 102 win 244
0.8 > . 102:102(0) ack 1002 win 244

// Our side is now in TIME_WAIT state, send ack for fin.
0.9 < F. 1002:1002(0) ack 102 win 244
0.9 > . 102:102(0) ack 1002 win 244

// Peer reopens with in-window SYN:
1.000 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// Therefore, reply with ACK.
1.000 > . 102:102(0) ack 1002 win 244

// Peer sends RST for this ACK.  Normally this RST results
// in tw socket removal, but rfc1337=1 setting prevents this.
1.100 < R 1002:1002(0) win 244

// second syn. Due to rfc1337=1 expect another pure ACK.
31.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>
31.0 > . 102:102(0) ack 1002 win 244

// .. and another RST from peer.
31.1 < R 1002:1002(0) win 244
31.2 `echo no timer restart;ss -m -e -a -i -n -t -o state TIME-WAIT`

// third syn after one minute.  Time-Wait socket should have expired by now.
63.0 < S 1000:1000(0) win 9200 <mss 1460,nop,nop,sackOK,nop,wscale 7>

// so we expect a syn-ack & 3whs to proceed from here on.
63.0 > S. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 7>

Without this patch, 'ss' shows restarts of tw timer and last packet is
thus just another pure ack, more than one minute later.

This restores the original code from commit 283fd6cf0be690a83
("Merge in ANK networking jumbo patch") in netdev-vger-cvs.git .

For some reason the else branch was removed/lost in 1f28b683339f7
("Merge in TCP/UDP optimizations and [..]") and timer restart became
unconditional.

Reported-by: Michal Tesar <mtesar@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:10:35 -07:00
Pavel Machek
b0e0b0abbd net/rds: RDS is not Radio Data System
Getting prompt "The RDS Protocol" (RDS) is not too helpful, and it is
easily confused with Radio Data System (which we may want to support
in kernel, too).

Signed-off-by: Pavel Machek <pavel@ucw.cz>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:09:53 -07:00
David Ahern
1367bbf52a net/ipv6: Do not reset nl_net in ip6_route_info_create
nl_net is set on entry to ip6_route_info_create. Only devices
within that namespace are considered so no need to reset it
before returning.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:04:52 -07:00
David Ahern
066b103008 net/ipv4: Add extack message that dev is required for ONLINK
Make IPv4 consistent with IPv6 and return an extack message that the
ONLINK flag requires a nexthop device.

Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:04:03 -07:00
Yuchung Cheng
7788174e87 tcp: change IPv6 flow-label upon receiving spurious retransmission
Currently a Linux IPv6 TCP sender will change the flow label upon
timeouts to potentially steer away from a data path that has gone
bad. However this does not help if the problem is on the ACK path
and the data path is healthy. In this case the receiver is likely
to receive repeated spurious retransmission because the sender
couldn't get the ACKs in time and has recurring timeouts.

This patch adds another feature to mitigate this problem. It
leverages the DSACK states in the receiver to change the flow
label of the ACKs to speculatively re-route the ACK packets.
In order to allow triggering on the second consecutive spurious
RTO, the receiver changes the flow label upon sending a second
consecutive DSACK for a sequence number below RCV.NXT.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:03:00 -07:00
Eric Dumazet
3a7ad0634f Revert "packet: switch kvzalloc to allocate memory"
This reverts commit 71e4128620.

mmap()/munmap() can not be backed by kmalloced pages :

We fault in :

    VM_BUG_ON_PAGE(PageSlab(page), page);

    unmap_single_vma+0x8a/0x110
    unmap_vmas+0x4b/0x90
    unmap_region+0xc9/0x140
    do_munmap+0x274/0x360
    vm_munmap+0x81/0xc0
    SyS_munmap+0x2b/0x40
    do_syscall_64+0x13e/0x1c0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

Fixes: 71e4128620 ("packet: switch kvzalloc to allocate memory")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: John Sperbeck <jsperbeck@google.com>
Bisected-by: John Sperbeck <jsperbeck@google.com>
Cc: Zhang Yu <zhangyu31@baidu.com>
Cc: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 23:00:28 -07:00
Cong Wang
506a03aa04 net_sched: add missing tcf_lock for act_connmark
According to the new locking rule, we have to take tcf_lock
for both ->init() and ->dump(), as RTNL will be removed.
However, it is missing for act_connmark.

Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 22:57:43 -07:00
Cong Wang
f061b48c17 Revert "net: sched: act: add extack for lookup callback"
This reverts commit 331a9295de ("net: sched: act: add extack for lookup callback").

This extack is never used after 6 months... In fact, it can be just
set in the caller, right after ->lookup().

Cc: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 22:50:15 -07:00
David S. Miller
fd3c040b24 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-09-01

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add AF_XDP zero-copy support for i40e driver (!), from Björn and Magnus.

2) BPF verifier improvements by giving each register its own liveness
   chain which allows to simplify and getting rid of skip_callee() logic,
   from Edward.

3) Add bpf fs pretty print support for percpu arraymap, percpu hashmap
   and percpu lru hashmap. Also add generic percpu formatted print on
   bpftool so the same can be dumped there, from Yonghong.

4) Add bpf_{set,get}sockopt() helper support for TCP_SAVE_SYN and
   TCP_SAVED_SYN options to allow reflection of tos/tclass from received
   SYN packet, from Nikita.

5) Misc improvements to the BPF sockmap test cases in terms of cgroup v2
   interaction and removal of incorrect shutdown() calls, from John.

6) Few cleanups in xdp_umem_assign_dev() and xdpsock samples, from Prashant.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-31 17:41:08 -07:00
Magnus Karlsson
93ee30f3e8 xsk: i40e: get rid of useless struct xdp_umem_props
This commit gets rid of the structure xdp_umem_props. It was there to
be able to break a dependency at one point, but this is no longer
needed. The values in the struct are instead stored directly in the
xdp_umem structure. This simplifies the xsk code as well as af_xdp
zero-copy drivers and as a bonus gets rid of one internal header file.

The i40e driver is also adapted to the new interface in this commit.

Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-01 01:38:16 +02:00
Prashant Bhole
a29c8bb640 xsk: remove unnecessary assignment
Since xdp_umem_query() was added one assignment of bpf.command was
missed from cleanup. Removing the assignment statement.

Fixes: 84c6b86875 ("xsk: don't allow umem replace at stack level")
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-01 01:36:08 +02:00
Nikita V. Shirokov
1e215300f1 bpf: add TCP_SAVE_SYN/TCP_SAVED_SYN options for bpf_(set|get)sockopt
Adding support for two new bpf get/set sockopts: TCP_SAVE_SYN (set)
and TCP_SAVED_SYN (get). This would allow for bpf program to build
logic based on data from ingress SYN packet (e.g. doing tcp's tos/
tclass reflection (see sample prog)) and do it transparently from
userspace program point of view.

Signed-off-by: Nikita V. Shirokov <tehnerd@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-01 01:35:59 +02:00
Colin Ian King
7296216776 xdp: remove redundant variable 'headroom'
Variable 'headroom' is being assigned but is never used hence it is
redundant and can be removed.

Cleans up clang warning:
variable ‘headroom’ set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-09-01 01:35:53 +02:00
Taehee Yoo
7acfda539c netfilter: nf_tables: release chain in flushing set
When element of verdict map is deleted, the delete routine should
release chain. however, flush element of verdict map routine doesn't
release chain.

test commands:
   %nft add table ip filter
   %nft add chain ip filter c1
   %nft add map ip filter map1 { type ipv4_addr : verdict \; }
   %nft add element ip filter map1 { 1 : jump c1 }
   %nft flush map ip filter map1
   %nft flush ruleset

splat looks like:
[ 4895.170899] kernel BUG at net/netfilter/nf_tables_api.c:1415!
[ 4895.178114] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 4895.178880] CPU: 0 PID: 1670 Comm: nft Not tainted 4.18.0+ #55
[ 4895.178880] RIP: 0010:nf_tables_chain_destroy.isra.28+0x39/0x220 [nf_tables]
[ 4895.178880] Code: fc ff df 53 48 89 fb 48 83 c7 50 48 89 fa 48 c1 ea 03 0f b6 04 02 84 c0 74 09 3c 03 7f 05 e8 3e 4c 25 e1 8b 43 50 85 c0 74 02 <0f> 0b 48 89 da 48 b8 00 00 00 00 00 fc ff df 48 c1 ea 03 80 3c 02
[ 4895.228342] RSP: 0018:ffff88010b98f4c0 EFLAGS: 00010202
[ 4895.234841] RAX: 0000000000000001 RBX: ffff8801131c6968 RCX: ffff8801146585b0
[ 4895.234841] RDX: 1ffff10022638d37 RSI: ffff8801191a9348 RDI: ffff8801131c69b8
[ 4895.234841] RBP: ffff8801146585a8 R08: 1ffff1002323526a R09: 0000000000000000
[ 4895.234841] R10: 0000000000000000 R11: 0000000000000000 R12: dead000000000200
[ 4895.234841] R13: dead000000000100 R14: ffffffffa3638af8 R15: dffffc0000000000
[ 4895.234841] FS:  00007f6d188e6700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 4895.234841] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 4895.234841] CR2: 00007ffe72b8df88 CR3: 000000010e2d4000 CR4: 00000000001006f0
[ 4895.234841] Call Trace:
[ 4895.234841]  nf_tables_commit+0x2704/0x2c70 [nf_tables]
[ 4895.234841]  ? nfnetlink_rcv_batch+0xa4f/0x11b0 [nfnetlink]
[ 4895.234841]  ? nf_tables_setelem_notify.constprop.48+0x1a0/0x1a0 [nf_tables]
[ 4895.323824]  ? __lock_is_held+0x9d/0x130
[ 4895.323824]  ? kasan_unpoison_shadow+0x30/0x40
[ 4895.333299]  ? kasan_kmalloc+0xa9/0xc0
[ 4895.333299]  ? kmem_cache_alloc_trace+0x2c0/0x310
[ 4895.333299]  ? nfnetlink_rcv_batch+0xa4f/0x11b0 [nfnetlink]
[ 4895.333299]  nfnetlink_rcv_batch+0xdb9/0x11b0 [nfnetlink]
[ 4895.333299]  ? debug_show_all_locks+0x290/0x290
[ 4895.333299]  ? nfnetlink_net_init+0x150/0x150 [nfnetlink]
[ 4895.333299]  ? sched_clock_cpu+0xe5/0x170
[ 4895.333299]  ? sched_clock_local+0xff/0x130
[ 4895.333299]  ? sched_clock_cpu+0xe5/0x170
[ 4895.333299]  ? find_held_lock+0x39/0x1b0
[ 4895.333299]  ? sched_clock_local+0xff/0x130
[ 4895.333299]  ? memset+0x1f/0x40
[ 4895.333299]  ? nla_parse+0x33/0x260
[ 4895.333299]  ? ns_capable_common+0x6e/0x110
[ 4895.333299]  nfnetlink_rcv+0x2c0/0x310 [nfnetlink]
[ ... ]

Fixes: 591054469b ("netfilter: nf_tables: revisit chain/object refcounting from elements")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-31 19:07:11 +02:00
Florian Westphal
e075841220 netfilter: kconfig: nat related expression depend on nftables core
NF_TABLES_IPV4 is now boolean so it is possible to set

NF_TABLES=m
NF_TABLES_IPV4=y
NFT_CHAIN_NAT_IPV4=y

which causes:
nft_chain_nat_ipv4.c:(.text+0x6d): undefined reference to `nft_do_chain'

Wrap NFT_CHAIN_NAT_IPV4 and related nat expressions with NF_TABLES to
restore the dependency.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 02c7b25e5f ("netfilter: nf_tables: build-in filter chain type")
Signed-off-by: Florian Westphal <fw@strlen.de>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-31 19:07:10 +02:00
Björn Töpel
18baed2684 xsk: include XDP meta data in AF_XDP frames
Previously, the AF_XDP (XDP_DRV/XDP_SKB copy-mode) ingress logic did
not include XDP meta data in the data buffers copied out to the user
application.

In this commit, we check if meta data is available, and if so, it is
prepended to the frame.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-30 15:25:40 +02:00
Mukesh Ojha
13ba17bee1 notifier: Remove notifier header file wherever not used
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.

Signed-off-by: Mukesh Ojha <mojha@codeaurora.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
2018-08-30 12:56:40 +02:00
Johannes Berg
aa58acf325 mac80211: always account for A-MSDU header changes
In the error path of changing the SKB headroom of the second
A-MSDU subframe, we would not account for the already-changed
length of the first frame that just got converted to be in
A-MSDU format and thus is a bit longer now.

Fix this by doing the necessary accounting.

It would be possible to reorder the operations, but that would
make the code more complex (to calculate the necessary pad),
and the headroom expansion should not fail frequently enough
to make that worthwhile.

Fixes: 6e0456b545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Acked-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-30 11:04:19 +02:00
Lorenzo Bianconi
1eb5079036 mac80211: do not convert to A-MSDU if frag/subframe limited
Do not start to aggregate packets in a A-MSDU frame (converting the
first subframe to A-MSDU, adding the header) if max_tx_fragments or
max_amsdu_subframes limits are already exceeded by it. In particular,
this happens when drivers set the limit to 1 to avoid A-MSDUs at all.

Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
[reword commit message to be more precise]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-30 09:59:22 +02:00
Arunk Khandavalli
4f0223bfe9 cfg80211: nl80211_update_ft_ies() to validate NL80211_ATTR_IE
nl80211_update_ft_ies() tried to validate NL80211_ATTR_IE with
is_valid_ie_attr() before dereferencing it, but that helper function
returns true in case of NULL pointer (i.e., attribute not included).
This can result to dereferencing a NULL pointer. Fix that by explicitly
checking that NL80211_ATTR_IE is included.

Fixes: 355199e02b ("cfg80211: Extend support for IEEE 802.11r Fast BSS Transition")
Signed-off-by: Arunk Khandavalli <akhandav@codeaurora.org>
Signed-off-by: Jouni Malinen <jouni@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-30 09:58:21 +02:00
David S. Miller
f0259b6ac4 Only a few changes at this point:
* new channels in 60 GHz
  * clarify (average) ACK signal reporting API
  * expose ieee80211_send_layer2_update() for all drivers
  * start/stop mac80211's TXQs properly when required
  * avoid regulatory restore with IE ignoring
  * spelling: contidion -> condition
  * fully implement WFA Multi-AP backhaul
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEH1e1rEeCd0AIMq6MB8qZga/fl8QFAluGiMwACgkQB8qZga/f
 l8SHMRAAhlYZNjZIHMqbyqRMFeGkgyfgQRYKb4xhrYr7v0542g5U99MWMNBtUJmq
 8aP4dzUFuPkR0qOi220PCs8PBBjuVcdTK1vq7AYiwiK4Um10/MtAlay6BUJFKVqU
 sJMaMbPy4mB3ocWl/q2K4nKaCZARsr854xwiIJZVwbc8n8t60Mr5ELbzELb5prGS
 jPpeRzYd7m4y4xnSYaiXWchdNOplFRN04NcuKJx10Pr3oWilGlj/ujGvwp78U6Uy
 v1H3T9S4XWMFkvl3deeOS6SVkejx76cvH8Ryoq+/qqQsAgs3c9tPQX+mwj6mNicq
 KsQcMIX6WmHNG5IcyaWat4LzdPgb5Xv31brA5tciZ3jebmIbc0P4dSYDLs7Jq1fg
 gkYuyNV3Jlwzv93RzqcrxfAIquZAvI7fy4CGiiRwtMk3wuHJlGy21PjGqZtWwqbn
 v0MbQf9riv1e653ygKSUpm1UoT5HMFbs6ZzbqpSy7Vr0e9+6B78Xkcp5u1DAvIan
 09YSqzKlypoGC+/802BL34HTpoUnf/hiBzVjYFmqvL/X2qv7oEUMOIv2x9Lg7NHh
 NZOPWcwjtPN57UP97Y6gCRAI2kJTigNdVnKISbIzZgNJ/HhB0M7ZmJ2UpB7EuJ1M
 q5aTIolqdoruwdGJ8d3gRr9xjDcuhhjr+FS8h6KfByJ0Qixroqk=
 =BW1J
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-next-for-davem-2018-08-29' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211-next

Johannes Berg says:

====================
Only a few changes at this point:
 * new channels in 60 GHz
 * clarify (average) ACK signal reporting API
 * expose ieee80211_send_layer2_update() for all drivers
 * start/stop mac80211's TXQs properly when required
 * avoid regulatory restore with IE ignoring
 * spelling: contidion -> condition
 * fully implement WFA Multi-AP backhaul
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 22:13:47 -07:00
Paolo Abeni
97763dc0f4 net_sched: reject unknown tcfa_action values
After the commit 802bfb1915 ("net/sched: user-space can't set
unknown tcfa_action values"), unknown tcfa_action values are
converted to TC_ACT_UNSPEC, but the common agreement is instead
rejecting such configurations.

This change also introduces a helper to simplify the destruction
of a single action, avoiding code duplication.

v1 -> v2:
 - helper is now static and renamed according to act_* convention
 - updated extack message, according to the new behavior

Fixes: 802bfb1915 ("net/sched: user-space can't set unknown tcfa_action values")
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 22:10:59 -07:00
Doron Roberts-Kedes
0927f71dbc net/tls: Calculate nsg for zerocopy path without skb_cow_data.
decrypt_skb fails if the number of sg elements required to map it
is greater than MAX_SKB_FRAGS. nsg must always be calculated, but
skb_cow_data adds unnecessary memcpy's for the zerocopy case.

The new function skb_nsg calculates the number of scatterlist elements
required to map the skb without the extra overhead of skb_cow_data.
This patch reduces memcpy by 50% on my encrypted NBD benchmarks.

Reported-by: Vakul Garg <Vakul.garg@nxp.com>
Reviewed-by: Vakul Garg <Vakul.garg@nxp.com>
Tested-by: Vakul Garg <Vakul.garg@nxp.com>
Signed-off-by: Doron Roberts-Kedes <doronrk@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:57:04 -07:00
Peter Oskolkov
0ff89efb52 ip: fail fast on IP defrag errors
The current behavior of IP defragmentation is inconsistent:
- some overlapping/wrong length fragments are dropped without
  affecting the queue;
- most overlapping fragments cause the whole frag queue to be dropped.

This patch brings consistency: if a bad fragment is detected,
the whole frag queue is dropped. Two major benefits:
- fail fast: corrupted frag queues are cleared immediately, instead of
  by timeout;
- testing of overlapping fragments is now much easier: any kind of
  random fragment length mutation now leads to the frag queue being
  discarded (IP packet dropped); before this patch, some overlaps were
  "corrected", with tests not seeing expected packet drops.

Note that in one case (see "if (end&7)" conditional) the current
behavior is preserved as there are concerns that this could be
legitimate padding.

Signed-off-by: Peter Oskolkov <posk@google.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:49:36 -07:00
Michal Kubecek
9b30049535 ethtool: drop get_settings and set_settings callbacks
Since [gs]et_settings ethtool_ops callbacks have been deprecated in
February 2016, all in tree NIC drivers have been converted to provide
[gs]et_link_ksettings() and out of tree drivers have had enough time to do
the same.

Drop get_settings() and set_settings() and implement both ETHTOOL_[GS]SET
and ETHTOOL_[GS]LINKSETTINGS only using [gs]et_link_ksettings().

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:46:10 -07:00
Sabrina Dubroca
f707ef61e1 net: rtnl: return early from rtnl_unregister_all when protocol isn't registered
rtnl_unregister_all(PF_INET6) gets called from inet6_init in cases when
no handler has been registered for PF_INET6 yet, for example if
ip6_mr_init() fails. Abort and avoid a NULL pointer deref in that case.

Example of panic (triggered by faking a failure of
 register_pernet_subsys):

    general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
    [...]
    RIP: 0010:rtnl_unregister_all+0x17e/0x2a0
    [...]
    Call Trace:
     ? rtnetlink_net_init+0x250/0x250
     ? sock_unregister+0x103/0x160
     ? kernel_getsockopt+0x200/0x200
     inet6_init+0x197/0x20d

Fixes: e2fddf5e96 ("[IPV6]: Make af_inet6 to check ip6_route_init return value.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:28:55 -07:00
Sabrina Dubroca
a03dc36bdc ipv6: fix cleanup ordering for pingv6 registration
Commit 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
contains an error in the cleanup path of inet6_init(): when
proto_register(&pingv6_prot, 1) fails, we try to unregister
&pingv6_prot. When rawv6_init() fails, we skip unregistering
&pingv6_prot.

Example of panic (triggered by faking a failure of
 proto_register(&pingv6_prot, 1)):

    general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
    [...]
    RIP: 0010:__list_del_entry_valid+0x79/0x160
    [...]
    Call Trace:
     proto_unregister+0xbb/0x550
     ? trace_preempt_on+0x6f0/0x6f0
     ? sock_no_shutdown+0x10/0x10
     inet6_init+0x153/0x1b8

Fixes: 6d0bfe2261 ("net: ipv6: Add IPv6 support to the ping socket.")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:28:55 -07:00
Sabrina Dubroca
afe49de44c ipv6: fix cleanup ordering for ip6_mr failure
Commit 15e668070a ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
moved the cleanup label for ipmr_fail, but should have changed the
contents of the cleanup labels as well. Now we can end up cleaning up
icmpv6 even though it hasn't been initialized (jump to icmp_fail or
ipmr_fail).

Simply undo things in the reverse order of their initialization.

Example of panic (triggered by faking a failure of icmpv6_init):

    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN PTI
    [...]
    RIP: 0010:__list_del_entry_valid+0x79/0x160
    [...]
    Call Trace:
     ? lock_release+0x8a0/0x8a0
     unregister_pernet_operations+0xd4/0x560
     ? ops_free_list+0x480/0x480
     ? down_write+0x91/0x130
     ? unregister_pernet_subsys+0x15/0x30
     ? down_read+0x1b0/0x1b0
     ? up_read+0x110/0x110
     ? kmem_cache_create_usercopy+0x1b4/0x240
     unregister_pernet_subsys+0x1d/0x30
     icmpv6_cleanup+0x1d/0x30
     inet6_init+0x1b5/0x23f

Fixes: 15e668070a ("ipv6: reorder icmpv6_init() and ip6_mr_init()")
Signed-off-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:28:55 -07:00
YueHaibing
8bad008e79 net/ncsi: remove duplicated include from ncsi-netlink.c
Remove duplicated include.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 19:10:40 -07:00
Davide Caratti
85eb9af182 net/sched: act_pedit: fix dump of extended layered op
in the (rare) case of failure in nla_nest_start(), missing NULL checks in
tcf_pedit_key_ex_dump() can make the following command

 # tc action add action pedit ex munge ip ttl set 64

dereference a NULL pointer:

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
 PGD 800000007d1cd067 P4D 800000007d1cd067 PUD 7acd3067 PMD 0
 Oops: 0002 [#1] SMP PTI
 CPU: 0 PID: 3336 Comm: tc Tainted: G            E     4.18.0.pedit+ #425
 Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
 RIP: 0010:tcf_pedit_dump+0x19d/0x358 [act_pedit]
 Code: be 02 00 00 00 48 89 df 66 89 44 24 20 e8 9b b1 fd e0 85 c0 75 46 8b 83 c8 00 00 00 49 83 c5 08 48 03 83 d0 00 00 00 4d 39 f5 <66> 89 04 25 00 00 00 00 0f 84 81 01 00 00 41 8b 45 00 48 8d 4c 24
 RSP: 0018:ffffb5d4004478a8 EFLAGS: 00010246
 RAX: ffff8880fcda2070 RBX: ffff8880fadd2900 RCX: 0000000000000000
 RDX: 0000000000000002 RSI: ffffb5d4004478ca RDI: ffff8880fcda206e
 RBP: ffff8880fb9cb900 R08: 0000000000000008 R09: ffff8880fcda206e
 R10: ffff8880fadd2900 R11: 0000000000000000 R12: ffff8880fd26cf40
 R13: ffff8880fc957430 R14: ffff8880fc957430 R15: ffff8880fb9cb988
 FS:  00007f75a537a740(0000) GS:ffff8880fda00000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 CR2: 0000000000000000 CR3: 000000007a2fa005 CR4: 00000000001606f0
 Call Trace:
  ? __nla_reserve+0x38/0x50
  tcf_action_dump_1+0xd2/0x130
  tcf_action_dump+0x6a/0xf0
  tca_get_fill.constprop.31+0xa3/0x120
  tcf_action_add+0xd1/0x170
  tc_ctl_action+0x137/0x150
  rtnetlink_rcv_msg+0x263/0x2d0
  ? _cond_resched+0x15/0x40
  ? rtnl_calcit.isra.30+0x110/0x110
  netlink_rcv_skb+0x4d/0x130
  netlink_unicast+0x1a3/0x250
  netlink_sendmsg+0x2ae/0x3a0
  sock_sendmsg+0x36/0x40
  ___sys_sendmsg+0x26f/0x2d0
  ? do_wp_page+0x8e/0x5f0
  ? handle_pte_fault+0x6c3/0xf50
  ? __handle_mm_fault+0x38e/0x520
  ? __sys_sendmsg+0x5e/0xa0
  __sys_sendmsg+0x5e/0xa0
  do_syscall_64+0x5b/0x180
  entry_SYSCALL_64_after_hwframe+0x44/0xa9
 RIP: 0033:0x7f75a4583ba0
 Code: c3 48 8b 05 f2 62 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d fd c3 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8 ae cc 00 00 48 89 04 24
 RSP: 002b:00007fff60ee7418 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
 RAX: ffffffffffffffda RBX: 00007fff60ee7540 RCX: 00007f75a4583ba0
 RDX: 0000000000000000 RSI: 00007fff60ee7490 RDI: 0000000000000003
 RBP: 000000005b842d3e R08: 0000000000000002 R09: 0000000000000000
 R10: 00007fff60ee6ea0 R11: 0000000000000246 R12: 0000000000000000
 R13: 00007fff60ee7554 R14: 0000000000000001 R15: 000000000066c100
 Modules linked in: act_pedit(E) ip6table_filter ip6_tables iptable_filter binfmt_misc crct10dif_pclmul ext4 crc32_pclmul mbcache ghash_clmulni_intel jbd2 pcbc snd_hda_codec_generic snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm aesni_intel crypto_simd snd_timer cryptd glue_helper snd joydev pcspkr soundcore virtio_balloon i2c_piix4 nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c ata_generic pata_acpi virtio_net net_failover virtio_blk virtio_console failover qxl crc32c_intel drm_kms_helper syscopyarea serio_raw sysfillrect sysimgblt fb_sys_fops ttm drm ata_piix virtio_pci libata virtio_ring i2c_core virtio floppy dm_mirror dm_region_hash dm_log dm_mod [last unloaded: act_pedit]
 CR2: 0000000000000000

Like it's done for other TC actions, give up dumping pedit rules and return
an error if nla_nest_start() returns NULL.

Fixes: 71d0ed7079 ("net/act_pedit: Support using offset relative to the conventional network headers")
Signed-off-by: Davide Caratti <dcaratti@redhat.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 18:11:05 -07:00
Cong Wang
9a07efa9ae tipc: switch to rhashtable iterator
syzbot reported a use-after-free in tipc_group_fill_sock_diag(),
where tipc_group_fill_sock_diag() still reads tsk->group meanwhile
tipc_group_delete() just deletes it in tipc_release().

tipc_nl_sk_walk() aims to lock this sock when walking each sock
in the hash table to close race conditions with sock changes like
this one, by acquiring tsk->sk.sk_lock.slock spinlock, unfortunately
this doesn't work at all. All non-BH call path should take
lock_sock() instead to make it work.

tipc_nl_sk_walk() brutally iterates with raw rht_for_each_entry_rcu()
where RCU read lock is required, this is the reason why lock_sock()
can't be taken on this path. This could be resolved by switching to
rhashtable iterator API's, where taking a sleepable lock is possible.
Also, the iterator API's are friendly for restartable calls like
diag dump, the last position is remembered behind the scence,
all we need to do here is saving the iterator into cb->args[].

I tested this with parallel tipc diag dump and thousands of tipc
socket creation and release, no crash or memory leak.

Reported-by: syzbot+b9c8f3ab2994b7cd1625@syzkaller.appspotmail.com
Cc: Jon Maloy <jon.maloy@ericsson.com>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 18:04:54 -07:00
Cong Wang
bd583fe304 tipc: fix a missing rhashtable_walk_exit()
rhashtable_walk_exit() must be paired with rhashtable_walk_enter().

Fixes: 40f9f43970 ("tipc: Fix tipc_sk_reinit race conditions")
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Ying Xue <ying.xue@windriver.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 17:52:58 -07:00
Alexey Kodanev
9f28954614 vti6: remove !skb->ignore_df check from vti6_xmit()
Before the commit d6990976af ("vti6: fix PMTU caching and reporting
on xmit") '!skb->ignore_df' check was always true because the function
skb_scrub_packet() was called before it, resetting ignore_df to zero.

In the commit, skb_scrub_packet() was moved below, and now this check
can be false for the packet, e.g. when sending it in the two fragments,
this prevents successful PMTU updates in such case. The next attempts
to send the packet lead to the same tx error. Moreover, vti6 initial
MTU value relies on PMTU adjustments.

This issue can be reproduced with the following LTP test script:
    udp_ipsec_vti.sh -6 -p ah -m tunnel -s 2000

Fixes: ccd740cbc6 ("vti6: Add pmtu handling to vti6_xmit.")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Acked-by: Steffen Klassert <steffen.klassert@secunet.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-29 17:51:44 -07:00
Björn Töpel
9025403420 xsk: expose xdp_umem_get_{data,dma} to drivers
Move the xdp_umem_get_{data,dma} functions to include/net/xdp_sock.h,
so that the upcoming zero-copy implementation in the Ethernet drivers
can utilize them.

Also, supply some dummy function implementations for
CONFIG_XDP_SOCKETS=n configs.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 12:25:53 -07:00
Björn Töpel
dce5bd6140 xdp: export xdp_rxq_info_unreg_mem_model
Export __xdp_rxq_info_unreg_mem_model as xdp_rxq_info_unreg_mem_model,
so it can be used from netdev drivers. Also, add additional checks for
the memory type.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 12:25:53 -07:00
Björn Töpel
b0d1beeff2 xdp: implement convert_to_xdp_frame for MEM_TYPE_ZERO_COPY
This commit adds proper MEM_TYPE_ZERO_COPY support for
convert_to_xdp_frame. Converting a MEM_TYPE_ZERO_COPY xdp_buff to an
xdp_frame is done by transforming the MEM_TYPE_ZERO_COPY buffer into a
MEM_TYPE_PAGE_ORDER0 frame. This is costly, and in the future it might
make sense to implement a more sophisticated thread-safe alloc/free
scheme for MEM_TYPE_ZERO_COPY, so that no allocation and copy is
required in the fast-path.

Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 12:25:53 -07:00
Daniel Borkmann
a8cf76a902 bpf: fix sg shift repair start offset in bpf_msg_pull_data
When we perform the sg shift repair for the scatterlist ring, we
currently start out at i = first_sg + 1. However, this is not
correct since the first_sg could point to the sge sitting at slot
MAX_SKB_FRAGS - 1, and a subsequent i = MAX_SKB_FRAGS will access
the scatterlist ring (sg) out of bounds. Add the sk_msg_iter_var()
helper for iterating through the ring, and apply the same rule
for advancing to the next ring element as we do elsewhere. Later
work will use this helper also in other places.

Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 10:47:17 -07:00
Daniel Borkmann
2e43f95dd8 bpf: fix shift upon scatterlist ring wrap-around in bpf_msg_pull_data
If first_sg and last_sg wraps around in the scatterlist ring, then we
need to account for that in the shift as well. E.g. crafting such msgs
where this is the case leads to a hang as shift becomes negative. E.g.
consider the following scenario:

  first_sg := 14     |=>    shift := -12     msg->sg_start := 10
  last_sg  :=  3     |                       msg->sg_end   :=  5

round  1:  i := 15, move_from :=   3, sg[15] := sg[  3]
round  2:  i :=  0, move_from := -12, sg[ 0] := sg[-12]
round  3:  i :=  1, move_from := -11, sg[ 1] := sg[-11]
round  4:  i :=  2, move_from := -10, sg[ 2] := sg[-10]
[...]
round 13:  i := 11, move_from :=  -1, sg[ 2] := sg[ -1]
round 14:  i := 12, move_from :=   0, sg[ 2] := sg[  0]
round 15:  i := 13, move_from :=   1, sg[ 2] := sg[  1]
round 16:  i := 14, move_from :=   2, sg[ 2] := sg[  2]
round 17:  i := 15, move_from :=   3, sg[ 2] := sg[  3]
[...]

This means we will loop forever and never hit the msg->sg_end condition
to break out of the loop. When we see that the ring wraps around, then
the shift should be MAX_SKB_FRAGS - first_sg + last_sg - 1. Meaning,
the remainder slots from the tail of the ring and the head until last_sg
combined.

Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 10:47:17 -07:00
Daniel Borkmann
0e06b227c5 bpf: fix msg->data/data_end after sg shift repair in bpf_msg_pull_data
In the current code, msg->data is set as sg_virt(&sg[i]) + start - offset
and msg->data_end relative to it as msg->data + bytes. Using iterator i
to point to the updated starting scatterlist element holds true for some
cases, however not for all where we'd end up pointing out of bounds. It
is /correct/ for these ones:

1) When first finding the starting scatterlist element (sge) where we
   find that the page is already privately owned by the msg and where
   the requested bytes and headroom fit into the sge's length.

However, it's /incorrect/ for the following ones:

2) After we made the requested area private and updated the newly allocated
   page into first_sg slot of the scatterlist ring; when we find that no
   shift repair of the ring is needed where we bail out updating msg->data
   and msg->data_end. At that point i will point to last_sg, which in this
   case is the next elem of first_sg in the ring. The sge at that point
   might as well be invalid (e.g. i == msg->sg_end), which we use for
   setting the range of sg_virt(&sg[i]). The correct one would have been
   first_sg.

3) Similar as in 2) but when we find that a shift repair of the ring is
   needed. In this case we fix up all sges and stop once we've reached the
   end. In this case i will point to will point to the new msg->sg_end,
   and the sge at that point will be invalid. Again here the requested
   range sits in first_sg.

Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-29 10:47:17 -07:00
Arnd Bergmann
c2e6c8567a y2038: socket: Change recvmmsg to use __kernel_timespec
This converts the recvmmsg() system call in all its variations to use
'timespec64' internally for its timeout, and have a __kernel_timespec64
argument in the native entry point. This lets us change the type to use
64-bit time_t at a later point while using the 32-bit compat system call
emulation for existing user space.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-29 15:42:24 +02:00
Florian Westphal
0434ccdcf8 netfilter: nf_tables: rework ct timeout set support
Using a private template is problematic:

1. We can't assign both a zone and a timeout policy
   (zone assigns a conntrack template, so we hit problem 1)
2. Using a template needs to take care of ct refcount, else we'll
   eventually free the private template due to ->use underflow.

This patch reworks template policy to instead work with existing conntrack.

As long as such conntrack has not yet been placed into the hash table
(unconfirmed) we can still add the timeout extension.

The only caveat is that we now need to update/correct ct->timeout to
reflect the initial/new state, otherwise the conntrack entry retains the
default 'new' timeout.

Side effect of this change is that setting the policy must
now occur from chains that are evaluated *after* the conntrack lookup
has taken place.

No released kernel contains the timeout policy feature yet, so this change
should be ok.

Changes since v2:
 - don't handle 'ct is confirmed case'
 - after previous patch, no need to special-case tcp/dccp/sctp timeout
   anymore

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-29 13:04:38 +02:00
Florian Westphal
ef39078d63 netfilter: conntrack: place 'new' timeout in first location too
tcp, sctp and dccp trackers re-use the userspace ctnetlink states
to index their timeout arrays, which means timeout[0] is never
used.  Copy the 'new' state (syn-sent, dccp-request, ..) to 0 as well
so external users can simply read it off timeouts[0] without need to
differentiate dccp/sctp/tcp and udp/icmp/gre/generic.

The alternative is to map all array accesses to 'i - 1', but that
is a much more intrusive change.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-29 13:04:38 +02:00
Sara Sharon
166ac9d55b mac80211: avoid kernel panic when building AMSDU from non-linear SKB
When building building AMSDU from non-linear SKB, we hit a
kernel panic when trying to push the padding to the tail.
Instead, put the padding at the head of the next subframe.
This also fixes the A-MSDU subframes to not have the padding
accounted in the length field and not have pad at all for
the last subframe, both required by the spec.

Fixes: 6e0456b545 ("mac80211: add A-MSDU tx support")
Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Reviewed-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-29 12:17:55 +02:00
Yuan-Chi Pang
1f631c3201 mac80211: mesh: fix HWMP sequence numbering to follow standard
IEEE 802.11-2016 14.10.8.3 HWMP sequence numbering says:
If it is a target mesh STA, it shall update its own HWMP SN to
maximum (current HWMP SN, target HWMP SN in the PREQ element) + 1
immediately before it generates a PREP element in response to a
PREQ element.

Signed-off-by: Yuan-Chi Pang <fu3mo6goo@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-29 11:15:30 +02:00
Balaji Pothunoori
9c06602b1b cfg80211: clarify frames covered by average ACK signal report
Modify the API to include all ACK frames in average ACK
signal strength reporting, not just ACKs for data frames.
Make exposing the data conditional on implementing the
extended feature flag.

This is how it was really implemented in mac80211, update
the code there to use the new defines and clean up some of
the setting code.

Keep nl80211.h source compatibility by keeping the old names.

Signed-off-by: Balaji Pothunoori <bpothuno@codeaurora.org>
[rewrite commit log, change compatibility to be old=new
 instead of the other way around, update kernel-doc,
 roll in mac80211 changes, make mac80211 depend on valid
 bit instead of HW flag]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-29 11:01:51 +02:00
Sathishkumar Muruganandam
1ecef20cf1 mac80211: add missing WFA Multi-AP backhaul STA Rx requirement
The current mac80211 WDS (4-address mode) can be used to cover most of the
Multi-AP requirements for Data frames per the WFA Multi-AP Specification v1.0.
When configuring AP/STA interfaces in 4-address mode, they are able to function
as fronthaul AP/backhaul STA of Multi-AP device complying below
Tx, Rx requirements except one missing STA Rx requirement added by this patch.

Multi-AP specification section 14.1 describes the following requirements:

Transmitter requirements
------------------------
1. Fronthaul AP
        i) When DA!=RA of backhaul STA, must use 4-address format
        ii) When DA==RA of backhaul STA, shall use either 3-address
            or 4-address format with RA updated with STA MAC

            (mac80211 support 4-address format via AP/VLAN interface)

2. Backhaul STA
        i) When SA!=TA of backhaul STA, must use 4-address format
        ii) When SA==TA of backhaul STA, shall use either 3-address
            or 4-address format with RA updated with AP MAC

            (mac80211 support 4-address format via use_4addr)

Receiver requirements
---------------------
1. Fronthaul AP
        i) When SA!=TA of backhaul STA, must support receiving 4-address
           format frames
        ii) When SA==TA of backhaul STA, must support receiving both
            3-address and 4-address format frames

            (mac80211 support both 3-addr & 4-addr via AP/VLAN interface)

2. Backhaul STA
        i) When DA!=RA of backhaul STA, must support receiving 4-address
           format frames
        ii) When DA==RA of backhaul STA,  must support receiving both
            3-address and 4-address format frames

            (mac80211 support only receiving 4-address format via
             use_4addr)

This patch addresses the above Rx requirement (ii) for backhaul STA to receive
unicast (DA==RA) 3-address frames in addition to 4-address frames.

The current design doesn't accept 3-address frames when configured in 4-address
mode (use_4addr). Hence add a check to allow 3-address frames when DA==RA of
backhaul STA (adhering to Table 9-26 of IEEE Std 802.11™-2016).

This case was tested with a bridged station interface when associated with
a non-mac80211 based vendor AP implementation using 3-address frames for WDS.

STA was able to support the Multi-AP Rx requirement when DA==RA. No issues,
no loops seen when tested with mac80211 based AP as well.

Verified and confirmed all other Tx and Rx requirements of AP and STA for
Multi-AP respectively. They all work using the current mac80211-WDS design.

Signed-off-by: Sathishkumar Muruganandam <murugana@codeaurora.org>

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-29 09:25:17 +02:00
Shannon Nelson
4a132095dd xfrm: allow driver to quietly refuse offload
If the "offload" attribute is used to create an IPsec SA
and the .xdo_dev_state_add() fails, the SA creation fails.
However, if the "offload" attribute is used on a device that
doesn't offer it, the attribute is quietly ignored and the SA
is created without an offload.

Along the same line of that second case, it would be good to
have a way for the device to refuse to offload an SA without
failing the whole SA creation.  This patch adds that feature
by allowing the driver to return -EOPNOTSUPP as a signal that
the SA may be fine, it just can't be offloaded.

This allows the user a little more flexibility in requesting
offloads and not needing to know every detail at all times about
each specific NIC when trying to create SAs.

Signed-off-by: Shannon Nelson <shannon.nelson@oracle.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-08-29 08:04:44 +02:00
Haishuang Yan
0c05f98376 esp: remove redundant define esph
The pointer 'esph' is defined but is never used hence it is redundant
and canbe removed.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-08-29 08:02:43 +02:00
Wei Yongjun
211d6f2dc8 xfrm: Make function xfrmi_get_link_net() static
Fixes the following sparse warning:

net/xfrm/xfrm_interface.c:745:12: warning:
 symbol 'xfrmi_get_link_net' was not declared. Should it be static?

Fixes: f203b76d78 ("xfrm: Add virtual xfrm interfaces")
Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-08-29 07:56:36 +02:00
Daniel Borkmann
5b24109b05 bpf: fix several offset tests in bpf_msg_pull_data
While recently going over bpf_msg_pull_data(), I noticed three
issues which are fixed in here:

1) When we attempt to find the first scatterlist element (sge)
   for the start offset, we add len to the offset before we check
   for start < offset + len, whereas it should come after when
   we iterate to the next sge to accumulate the offsets. For
   example, given a start offset of 12 with a sge length of 8
   for the first sge in the list would lead us to determine this
   sge as the first sge thinking it covers first 16 bytes where
   start is located, whereas start sits in subsequent sges so
   we would end up pulling in the wrong data.

2) After figuring out the starting sge, we have a short-cut test
   in !msg->sg_copy[i] && bytes <= len. This checks whether it's
   not needed to make the page at the sge private where we can
   just exit by updating msg->data and msg->data_end. However,
   the length test is not fully correct. bytes <= len checks
   whether the requested bytes (end - start offsets) fit into the
   sge's length. The part that is missing is that start must not
   be sge length aligned. Meaning, the start offset into the sge
   needs to be accounted as well on top of the requested bytes
   as otherwise we can access the sge out of bounds. For example
   the sge could have length of 8, our requested bytes could have
   length of 8, but at a start offset of 4, so we also would need
   to pull in 4 bytes of the next sge, when we jump to the out
   label we do set msg->data to sg_virt(&sg[i]) + start - offset
   and msg->data_end to msg->data + bytes which would be oob.

3) The subsequent bytes < copy test for finding the last sge has
   the same issue as in point 2) but also it tests for less than
   rather than less or equal to. Meaning if the sge length is of
   8 and requested bytes of 8 while having the start aligned with
   the sge, we would unnecessarily go and pull in the next sge as
   well to make it private.

Fixes: 015632bb30 ("bpf: sk_msg program helper bpf_sk_msg_pull_data")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-28 22:23:45 -07:00
Matthew Wilcox
6348b903d7 9p: Remove p9_idpool
There are no more users left of the p9_idpool; delete it.

Link: http://lkml.kernel.org/r/20180711210225.19730-7-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-29 13:39:57 +09:00
Matthew Wilcox
996d5b4db4 9p: Use a slab for allocating requests
Replace the custom batch allocation with a slab.  Use an IDR to store
pointers to the active requests instead of an array.  We don't try to
handle P9_NOTAG specially; the IDR will happily shrink all the way back
once the TVERSION call has completed.

Link: http://lkml.kernel.org/r/20180711210225.19730-6-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-29 13:39:57 +09:00
Dominique Martinet
62e3941776 9p: clear dangling pointers in p9stat_free
p9stat_free is more of a cleanup function than a 'free' function as it
only frees the content of the struct; there are chances of use-after-free
if it is improperly used (e.g. p9stat_free called twice as it used to be
possible to)

Clearing dangling pointers makes the function idempotent and safer to use.

Link: http://lkml.kernel.org/r/1535410108-20650-2-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Reported-by: syzbot+d4252148d198410b864f@syzkaller.appspotmail.com
2018-08-29 13:39:57 +09:00
Haim Dreyfuss
b88d26d97c nl80211: Pass center frequency in kHz instead of MHz
freq_reg_info expects to get the frequency in kHz. Instead we
accidently pass it in MHz.  Thus, currently the function always
return ERR rule. Fix that.

Fixes: 50f32718e1 ("nl80211: Add wmm rule attribute to NL80211_CMD_GET_WIPHY dump command")
Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
[fix kHz/MHz in commit message]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:41:23 +02:00
Haim Dreyfuss
d3c89bbc74 nl80211: Fix nla_put_u8 to u16 for NL80211_WMMR_TXOP
TXOP (also known as Channel Occupancy Time) is u16 and should be
added using nla_put_u16 instead of u8, fix that.

Fixes: 50f32718e1 ("nl80211: Add wmm rule attribute to NL80211_CMD_GET_WIPHY dump command")
Signed-off-by: Haim Dreyfuss <haim.dreyfuss@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:37:28 +02:00
Alexei Avshalom Lazar
9cf0a0b4b6 cfg80211: Add support for 60GHz band channels 5 and 6
The current support in the 60GHz band is for channels 1-4.
Add support for channels 5 and 6.
This requires enlarging ieee80211_channel.center_freq from u16 to u32.

Signed-off-by: Alexei Avshalom Lazar <ailizaro@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:23:08 +02:00
Manikanta Pubbisetty
21a5d4c3a4 mac80211: add stop/start logic for software TXQs
Sometimes, it is required to stop the transmissions momentarily and
resume it later; stopping the txqs becomes very critical in scenarios where
the packet transmission has to be ceased completely. For example, during
the hardware restart, during off channel operations,
when initiating CSA(upon detecting a radar on the DFS channel), etc.

The TX queue stop/start logic in mac80211 works well in stopping the TX
when drivers make use of netdev queues, i.e, when Qdiscs in network layer
take care of traffic scheduling. Since the devices implementing
wake_tx_queue can run without Qdiscs, packets will be handed to mac80211
directly without queueing them in the netdev queues.

Also, mac80211 does not invoke any of the
netif_stop_*/netif_wake_* APIs if wake_tx_queue is implemented.
Since the queues are not stopped in this case, transmissions can continue
and this will impact negatively on the operation of the wireless device.

For example,
During hardware restart, we stop the netdev queues so that packets are
not sent to the driver. Since ath10k implements wake_tx_queue,
TX queues will not be stopped and packets might reach the hardware while
it is restarting; this can make hardware unresponsive and the only
possible option for recovery is to reboot the entire system.

There is another problem to this, it is observed that the packets
were sent on the DFS channel for a prolonged duration after radar
detection impacting the channel closing time.

We can still invoke netif stop/wake APIs when wake_tx_queue is implemented
but this could lead to packet drops in network layer; adding stop/start
logic for software TXQs in mac80211 instead makes more sense; the change
proposed adds the same in mac80211.

Signed-off-by: Manikanta Pubbisetty <mpubbise@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:16:35 +02:00
Rajeev Kumar Sirasanagandla
7417844b63 cfg80211: Avoid regulatory restore when COUNTRY_IE_IGNORE is set
When REGULATORY_COUNTRY_IE_IGNORE is set,  __reg_process_hint_country_ie()
ignores the country code change request from __cfg80211_connect_result()
via regulatory_hint_country_ie().

After Disconnect, similar to above, country code should not be reset to
world when country IE ignore is set. But this is violated and restore of
regulatory settings is invoked by cfg80211_disconnect_work via
regulatory_hint_disconnect().

To address this, avoid regulatory restore from regulatory_hint_disconnect()
when COUNTRY_IE_IGNORE is set.

Note: Currently, restore_regulatory_settings() takes care of clearing
beacon hints. But in the proposed change, regulatory restore is avoided.
Therefore, explicitly clear beacon hints when DISABLE_BEACON_HINTS
is not set.

Signed-off-by: Rajeev Kumar Sirasanagandla <rsirasan@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:15:51 +02:00
Dedy Lansky
30ca1aa536 cfg80211/mac80211: make ieee80211_send_layer2_update a public function
Make ieee80211_send_layer2_update() a common function so other drivers
can re-use it.

Signed-off-by: Dedy Lansky <dlansky@codeaurora.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:15:27 +02:00
Richard Guy Briggs
f404c3ecc4 rfkill: fix spelling mistake contidion to condition
This came about while trying to determine if there would be any pattern
match on contid, a new audit container identifier internal variable.
This was the only one.

Signed-off-by: Richard Guy Briggs <rgb@redhat.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:15:27 +02:00
Emmanuel Grumbach
20932750d9 mac80211: don't update the PM state of a peer upon a multicast frame
I changed the way mac80211 updates the PM state of the peer.
I forgot that we could also have multicast frames from the
peer and that those frame should of course not change the
PM state of the peer: A peer goes to power save when it
needs to scan, but it won't send the broadcast Probe Request
with the PM bit set.

This made us mark the peer as awake when it wasn't and then
Intel's firmware would fail to transmit because the peer is
asleep according to its database. The driver warned about
this and it looked like this:

 WARNING: CPU: 0 PID: 184 at /usr/src/linux-4.16.14/drivers/net/wireless/intel/iwlwifi/mvm/tx.c:1369 iwl_mvm_rx_tx_cmd+0x53b/0x860
 CPU: 0 PID: 184 Comm: irq/124-iwlwifi Not tainted 4.16.14 #1
 RIP: 0010:iwl_mvm_rx_tx_cmd+0x53b/0x860
 Call Trace:
  iwl_pcie_rx_handle+0x220/0x880
  iwl_pcie_irq_handler+0x6c9/0xa20
  ? irq_forced_thread_fn+0x60/0x60
  ? irq_thread_dtor+0x90/0x90

The relevant code that spits the WARNING is:

        case TX_STATUS_FAIL_DEST_PS:
                /* the FW should have stopped the queue and not
                 * return this status
                 */
                WARN_ON(1);
                info->flags |= IEEE80211_TX_STAT_TX_FILTERED;

This fixes https://bugzilla.kernel.org/show_bug.cgi?id=199967.

Fixes: 9fef654433 ("mac80211: always update the PM state of a peer on MGMT / DATA frames")
Cc: <stable@vger.kernel.org>   #4.16+
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:12:42 +02:00
Stanislaw Gruszka
38cb87ee47 cfg80211: make wmm_rule part of the reg_rule structure
Make wmm_rule be part of the reg_rule structure. This simplifies the
code a lot at the cost of having bigger memory usage. However in most
cases we have only few reg_rule's and when we do have many like in
iwlwifi we do not save memory as it allocates a separate wmm_rule for
each channel anyway.

This also fixes a bug reported in various places where somewhere the
pointers were corrupted and we ended up doing a null-dereference.

Fixes: 230ebaa189 ("cfg80211: read wmm rules from regulatory database")
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
[rephrase commit message slightly]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:11:47 +02:00
Danek Duvall
67d1ba8a6d mac80211: correct use of IEEE80211_VHT_CAP_RXSTBC_X
The mod mask for VHT capabilities intends to say that you can override
the number of STBC receive streams, and it does, but only by accident.
The IEEE80211_VHT_CAP_RXSTBC_X aren't bits to be set, but values (albeit
left-shifted).  ORing the bits together gets the right answer, but we
should use the _MASK macro here instead.

Signed-off-by: Danek Duvall <duvall@comfychair.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-28 11:08:59 +02:00
Stefan Agner
3f6e138d41 bpf: fix build error with clang
Building the newly introduced BPF_PROG_TYPE_SK_REUSEPORT leads to
a compile time error when building with clang:
net/core/filter.o: In function `sk_reuseport_convert_ctx_access':
  ../net/core/filter.c:7284: undefined reference to `__compiletime_assert_7284'

It seems that clang has issues resolving hweight_long at compile
time. Since SK_FL_PROTO_MASK is a constant, we can use the interface
for known constant arguments which works fine with clang.

Fixes: 2dbb9b9e6d ("bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT")
Signed-off-by: Stefan Agner <stefan@agner.ch>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-27 20:37:05 -07:00
Zhu Yanjun
53ae914d89 net/rds: Use rdma_read_gids to get connection SGID/DGID in IPv6
In IPv4, the newly introduced rdma_read_gids is used to read the SGID/DGID
for the connection which returns GID correctly for RoCE transport as well.

In IPv6, rdma_read_gids is also used. The following are why rdma_read_gids
is introduced.

rdma_addr_get_dgid() for RoCE for client side connections returns MAC
address, instead of DGID.
rdma_addr_get_sgid() for RoCE doesn't return correct SGID for IPv6 and
when more than one IP address is assigned to the netdevice.

So the transport agnostic rdma_read_gids() API is provided by rdma_cm
module.

Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:26:01 -07:00
Linus Walleij
ad8619864f net: dsa: Drop GPIO includes
Commit 52638f71fc ("dsa: Move gpio reset into switch driver")
moved the GPIO handling into the switch drivers but forgot
to remove the GPIO header includes.

Signed-off-by: Linus Walleij <linus.walleij@linaro.org>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:24:33 -07:00
Haiqing Bai
30935198b7 tipc: fix the big/little endian issue in tipc_dest
In function tipc_dest_push, the 32bit variables 'node' and 'port'
are stored separately in uppper and lower part of 64bit 'value'.
Then this value is assigned to dst->value which is a union like:
union
{
  struct {
    u32 port;
    u32 node;
  };
  u64 value;
}
This works on little-endian machines like x86 but fails on big-endian
machines.

The fix remove the 'value' stack parameter and even the 'value'
member of the union in tipc_dest, assign the 'node' and 'port' member
directly with the input parameter to avoid the endian issue.

Fixes: a80ae5306a ("tipc: improve destination linked list")
Signed-off-by: Zhenbo Gao <zhenbo.gao@windriver.com>
Acked-by: Jon Maloy <jon.maloy@ericsson.com>
Signed-off-by: Haiqing Bai <Haiqing.Bai@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:23:31 -07:00
Jiri Pirko
b7b4247d55 net: sched: return -ENOENT when trying to remove filter from non-existent chain
When chain 0 was implicitly created, removal of non-existent filter from
chain 0 gave -ENOENT. Once chain 0 became non-implicit, the same call is
giving -EINVAL. Fix this by returning -ENOENT in that case.

Reported-by: Roman Mashak <mrv@mojatatu.com>
Fixes: f71e0ca4db ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:16:16 -07:00
Jiri Pirko
d5ed72a55b net: sched: fix extack error message when chain is failed to be created
Instead "Cannot find" say "Cannot create".

Fixes: c35a4acc29 ("net: sched: cls_api: handle generic cls errors")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:16:16 -07:00
Xin Long
84581bdae9 erspan: set erspan_ver to 1 by default when adding an erspan dev
After erspan_ver is introudced, if erspan_ver is not set in iproute, its
value will be left 0 by default. Since Commit 02f99df187 ("erspan: fix
invalid erspan version."), it has broken the traffic due to the version
check in erspan_xmit if users are not aware of 'erspan_ver' param, like
using an old version of iproute.

To fix this compatibility problem, it sets erspan_ver to 1 by default
when adding an erspan dev in erspan_setup. Note that we can't do it in
ipgre_netlink_parms, as this function is also used by ipgre_changelink.

Fixes: 02f99df187 ("erspan: fix invalid erspan version.")
Reported-by: Jianlin Shi <jishi@redhat.com>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:13:17 -07:00
Xin Long
834539e69a sctp: remove useless start_fail from sctp_ht_iter in proc
After changing rhashtable_walk_start to return void, start_fail would
never be set other value than 0, and the checking for start_fail is
pointless, so remove it.

Fixes: 97a6ec4ac0 ("rhashtable: Change rhashtable_walk_start to return void")
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:13:17 -07:00
Xin Long
bab1be79a5 sctp: hold transport before accessing its asoc in sctp_transport_get_next
As Marcelo noticed, in sctp_transport_get_next, it is iterating over
transports but then also accessing the association directly, without
checking any refcnts before that, which can cause an use-after-free
Read.

So fix it by holding transport before accessing the association. With
that, sctp_transport_hold calls can be removed in the later places.

Fixes: 626d16f50f ("sctp: export some apis or variables for sctp_diag and reuse some for proc")
Reported-by: syzbot+fe62a0c9aa6a85c6de16@syzkaller.appspotmail.com
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-27 15:13:17 -07:00
Linus Torvalds
050cdc6c95 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) ICE, E1000, IGB, IXGBE, and I40E bug fixes from the Intel folks.

 2) Better fix for AB-BA deadlock in packet scheduler code, from Cong
    Wang.

 3) bpf sockmap fixes (zero sized key handling, etc.) from Daniel
    Borkmann.

 4) Send zero IPID in TCP resets and SYN-RECV state ACKs, to prevent
    attackers using it as a side-channel. From Eric Dumazet.

 5) Memory leak in mediatek bluetooth driver, from Gustavo A. R. Silva.

 6) Hook up rt->dst.input of ipv6 anycast routes properly, from Hangbin
    Liu.

 7) hns and hns3 bug fixes from Huazhong Tan.

 8) Fix RIF leak in mlxsw driver, from Ido Schimmel.

 9) iova range check fix in vhost, from Jason Wang.

10) Fix hang in do_tcp_sendpages() with tls, from John Fastabend.

11) More r8152 chips need to disable RX aggregation, from Kai-Heng Feng.

12) Memory exposure in TCA_U32_SEL handling, from Kees Cook.

13) TCP BBR congestion control fixes from Kevin Yang.

14) hv_netvsc, ignore non-PCI devices, from Stephen Hemminger.

15) qed driver fixes from Tomer Tayar.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (77 commits)
  net: sched: Fix memory exposure from short TCA_U32_SEL
  qed: fix spelling mistake "comparsion" -> "comparison"
  vhost: correctly check the iova range when waking virtqueue
  qlge: Fix netdev features configuration.
  net: macb: do not disable MDIO bus at open/close time
  Revert "net: stmmac: fix build failure due to missing COMMON_CLK dependency"
  net: macb: Fix regression breaking non-MDIO fixed-link PHYs
  mlxsw: spectrum_switchdev: Do not leak RIFs when removing bridge
  i40e: fix condition of WARN_ONCE for stat strings
  i40e: Fix for Tx timeouts when interface is brought up if DCB is enabled
  ixgbe: fix driver behaviour after issuing VFLR
  ixgbe: Prevent unsupported configurations with XDP
  ixgbe: Replace GFP_ATOMIC with GFP_KERNEL
  igb: Replace mdelay() with msleep() in igb_integrated_phy_loopback()
  igb: Replace GFP_ATOMIC with GFP_KERNEL in igb_sw_init()
  igb: Use an advanced ctx descriptor for launchtime
  e1000: ensure to free old tx/rx rings in set_ringparam()
  e1000: check on netif_running() before calling e1000_up()
  ixgb: use dma_zalloc_coherent instead of allocator/memset
  ice: Trivial formatting fixes
  ...
2018-08-27 11:59:39 -07:00
Arnd Bergmann
9afc5eee65 y2038: globally rename compat_time to old_time32
Christoph Hellwig suggested a slightly different path for handling
backwards compatibility with the 32-bit time_t based system calls:

Rather than simply reusing the compat_sys_* entry points on 32-bit
architectures unchanged, we get rid of those entry points and the
compat_time types by renaming them to something that makes more sense
on 32-bit architectures (which don't have a compat mode otherwise),
and then share the entry points under the new name with the 64-bit
architectures that use them for implementing the compatibility.

The following types and interfaces are renamed here, and moved
from linux/compat_time.h to linux/time32.h:

old				new
---				---
compat_time_t			old_time32_t
struct compat_timeval		struct old_timeval32
struct compat_timespec		struct old_timespec32
struct compat_itimerspec	struct old_itimerspec32
ns_to_compat_timeval()		ns_to_old_timeval32()
get_compat_itimerspec64()	get_old_itimerspec32()
put_compat_itimerspec64()	put_old_itimerspec32()
compat_get_timespec64()		get_old_timespec32()
compat_put_timespec64()		put_old_timespec32()

As we already have aliases in place, this patch addresses only the
instances that are relevant to the system call interface in particular,
not those that occur in device drivers and other modules. Those
will get handled separately, while providing the 64-bit version
of the respective interfaces.

I'm not renaming the timex, rusage and itimerval structures, as we are
still debating what the new interface will look like, and whether we
will need a replacement at all.

This also doesn't change the names of the syscall entry points, which can
be done more easily when we actually switch over the 32-bit architectures
to use them, at that point we need to change COMPAT_SYSCALL_DEFINEx to
SYSCALL_DEFINEx with a new name, e.g. with a _time32 suffix.

Suggested-by: Christoph Hellwig <hch@infradead.org>
Link: https://lore.kernel.org/lkml/20180705222110.GA5698@infradead.org/
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
2018-08-27 14:48:48 +02:00
Dominique Martinet
2f9ad0ac94 9p/xen: fix check for xenbus_read error in front_probe
If the xen bus exists but does not expose the proper interface, it is
possible to get a non-zero length but still some error, leading to
strcmp failing trying to load invalid memory addresses e.g.
fffffffffffffffe.

There is then no need to check length when there is no error, as the
xenbus driver guarantees that the string is nul-terminated.

Link: http://lkml.kernel.org/r/1534236007-10170-1-git-send-email-asmadeus@codewreck.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
Reviewed-by: Stefano Stabellini <sstabellini@kernel.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Latchesar Ionkov <lucho@ionkov.net>
2018-08-27 14:59:29 +09:00
Kees Cook
98c8f125fd net: sched: Fix memory exposure from short TCA_U32_SEL
Via u32_change(), TCA_U32_SEL has an unspecified type in the netlink
policy, so max length isn't enforced, only minimum. This means nkeys
(from userspace) was being trusted without checking the actual size of
nla_len(), which could lead to a memory over-read, and ultimately an
exposure via a call to u32_dump(). Reachability is CAP_NET_ADMIN within
a namespace.

Reported-by: Al Viro <viro@zeniv.linux.org.uk>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Cc: Jiri Pirko <jiri@resnulli.us>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-26 14:21:50 -07:00
Linus Torvalds
aba16dc5cf Merge branch 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax
Pull IDA updates from Matthew Wilcox:
 "A better IDA API:

      id = ida_alloc(ida, GFP_xxx);
      ida_free(ida, id);

  rather than the cumbersome ida_simple_get(), ida_simple_remove().

  The new IDA API is similar to ida_simple_get() but better named.  The
  internal restructuring of the IDA code removes the bitmap
  preallocation nonsense.

  I hope the net -200 lines of code is convincing"

* 'ida-4.19' of git://git.infradead.org/users/willy/linux-dax: (29 commits)
  ida: Change ida_get_new_above to return the id
  ida: Remove old API
  test_ida: check_ida_destroy and check_ida_alloc
  test_ida: Convert check_ida_conv to new API
  test_ida: Move ida_check_max
  test_ida: Move ida_check_leaf
  idr-test: Convert ida_check_nomem to new API
  ida: Start new test_ida module
  target/iscsi: Allocate session IDs from an IDA
  iscsi target: fix session creation failure handling
  drm/vmwgfx: Convert to new IDA API
  dmaengine: Convert to new IDA API
  ppc: Convert vas ID allocation to new IDA API
  media: Convert entity ID allocation to new IDA API
  ppc: Convert mmu context allocation to new IDA API
  Convert net_namespace to new IDA API
  cb710: Convert to new IDA API
  rsxx: Convert to new IDA API
  osd: Convert to new IDA API
  sd: Convert to new IDA API
  ...
2018-08-26 11:48:42 -07:00
Florian Westphal
10568f6c57 netfilter: xt_checksum: ignore gso skbs
Satish Patel reports a skb_warn_bad_offload() splat caused
by -j CHECKSUM rules:

-A POSTROUTING -p tcp -m tcp --sport 80 -j CHECKSUM

The CHECKSUM target has never worked with GSO skbs, and the above rule
makes no sense as kernel will handle checksum updates on transmit.

Unfortunately, there are 3rd party tools that install such rules, so we
cannot reject this from the config plane without potential breakage.

Amend Kconfig text to clarify that the CHECKSUM target is only useful
in virtualized environments, where old dhcp clients that use AF_PACKET
used to discard UDP packets with a 'bad' header checksum and add a
one-time warning in case such rule isn't restricted to UDP.

v2: check IP6T_F_PROTO flag before cmp (Michal Kubecek)

Reported-by: Satish Patel <satish.txt@gmail.com>
Reported-by: Markos Chandras <markos.chandras@suse.com>
Reported-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Florian Westphal <fw@strlen.de>
Reviewed-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-24 09:58:16 +02:00
David S. Miller
ff0fadfffe Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-08-24

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix BPF sockmap and tls where we get a hang in do_tcp_sendpages()
   when sndbuf is full due to missing calls into underlying socket's
   sk_write_space(), from John.

2) Two BPF sockmap fixes to reject invalid parameters on map creation
   and to fix a map element miscount on allocation failure. Another fix
   for BPF hash tables to use per hash table salt for jhash(), from Daniel.

3) Fix for bpftool's command line parsing in order to terminate on bad
   arguments instead of keeping looping in some border cases, from Quentin.

4) Fix error value of xdp_umem_assign_dev() in order to comply with
   expected bind ops error codes, from Prashant.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-23 22:41:55 -07:00
Linus Torvalds
33e17876ea Merge branch 'akpm' (patches from Andrew)
Merge yet more updates from Andrew Morton:

 - the rest of MM

 - various misc fixes and tweaks

* emailed patches from Andrew Morton <akpm@linux-foundation.org>: (22 commits)
  mm: Change return type int to vm_fault_t for fault handlers
  lib/fonts: convert comments to utf-8
  s390: ebcdic: convert comments to UTF-8
  treewide: convert ISO_8859-1 text comments to utf-8
  drivers/gpu/drm/gma500/: change return type to vm_fault_t
  docs/core-api: mm-api: add section about GFP flags
  docs/mm: make GFP flags descriptions usable as kernel-doc
  docs/core-api: split memory management API to a separate file
  docs/core-api: move *{str,mem}dup* to "String Manipulation"
  docs/core-api: kill trailing whitespace in kernel-api.rst
  mm/util: add kernel-doc for kvfree
  mm/util: make strndup_user description a kernel-doc comment
  fs/proc/vmcore.c: hide vmcoredd_mmap_dumps() for nommu builds
  treewide: correct "differenciate" and "instanciate" typos
  fs/afs: use new return type vm_fault_t
  drivers/hwtracing/intel_th/msu.c: change return type to vm_fault_t
  mm: soft-offline: close the race against page allocation
  mm: fix race on soft-offlining free huge pages
  namei: allow restricted O_CREAT of FIFOs and regular files
  hfs: prevent crash on exit from failed search
  ...
2018-08-23 19:20:12 -07:00
Arnd Bergmann
3723c63247 treewide: convert ISO_8859-1 text comments to utf-8
Almost all files in the kernel are either plain text or UTF-8 encoded.  A
couple however are ISO_8859-1, usually just a few characters in a C
comments, for historic reasons.

This converts them all to UTF-8 for consistency.

Link: http://lkml.kernel.org/r/20180724111600.4158975-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Simon Horman <horms@verge.net.au>			[IPVS portion]
Acked-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>	[IIO]
Acked-by: Michael Ellerman <mpe@ellerman.id.au>			[powerpc]
Acked-by: Rob Herring <robh@kernel.org>
Cc: Joe Perches <joe@perches.com>
Cc: Arnd Bergmann <arnd@arndb.de>
Cc: Samuel Ortiz <sameo@linux.intel.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Rob Herring <robh+dt@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2018-08-23 18:48:43 -07:00
Linus Torvalds
53a01c9a5f NFS client updates for Linux 4.19
Stable bufixes:
 - v3.17+: Fix an off-by-one in bl_map_stripe()
 - v4.9+: NFSv4 client live hangs after live data migration recovery
 - v4.18+: xprtrdma: Fix disconnect regression
 - v4.14+: Fix locking in pnfs_generic_recover_commit_reqs
 - v4.9+: Fix a sleep in atomic context in nfs4_callback_sequence()
 
 Features:
 - Add support for asynchronous server-side COPY operations
 
 Other bugfixes and cleanups:
 - Optitmizations and fixes involving NFS v4.1 / pNFS layout handling
 - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
 - Immediately reschedule writeback when the server replies with an error
 - Fix excessive attribute revalidation in nfs_execute_ok()
 - Add error checking to nfs_idmap_prepare_message()
 - Use new vm_fault_t return type
 - Return a delegation when reclaiming one that the server has recalled
 - Referrals should inherit proto setting from parents
 - Make rpc_auth_create_args a const
 - Improvements to rpc_iostats tracking
 - Fix a potential reference leak when there is an error processing a callback
 - Fix rmdir / mkdir / rename nlink accounting
 - Fix updating inode change attribute
 - Fix error handling in nfsn4_sp4_select_mode()
 - Use an appropriate work queue for direct-write completion
 - Don't busy wait if NFSv4 session draining is interrupted
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCAAdFiEEnZ5MQTpR7cLU7KEp18tUv7ClQOsFAlt/CYIACgkQ18tUv7Cl
 QOu8gBAA0xQWmgRoG6oIdYUxvgYqhuJmMqC4SU1E6mCJ93xEuUSvEFw51X+84KCt
 r6UPkp/bKiVe3EIinKTplIzuxgggXNG0EQmO46FYNTl7nqpN85ffLsQoWsiD23fp
 j8afqKPFR2zfhHXLKQC7k1oiOpwGqJ+EJWgIW4llE80pSNaErEoEaDqSPds5thMN
 dHEjjLr8ef6cbBux6sSPjwWGNbE82uoSu3MDuV2+e62hpGkgvuEYo1vyE6ujeZW5
 MUsmw+AHZkwro0msTtNBOHcPZAS0q/2UMPzl1tsDeCWNl2mugqZ6szQLSS2AThKq
 Zr6iK9Q5dWjJfrQHcjRMnYJB+SCX1SfPA7ASuU34opwcWPjecbS9Q92BNTByQYwN
 o9ngs2K0mZfqpYESMAmf7Il134cCBrtEp3skGko2KopJcYcE5YUFhdKihi1yQQjU
 UbOOubMpQk8vY9DpDCAwGbICKwUZwGvq27uuUWL20kFVDb1+jvfHwcV4KjRAJo/E
 J9aFtU+qOh4rMPMnYlEVZcAZBGfenlv/DmBl1upRpjzBkteUpUJsAbCmGyAk4616
 3RECasehgsjNCQpFIhv3FpUkWzP5jt0T3gRr1NeY6WKJZwYnHEJr9PtapS+EIsCT
 tB5DvvaJqFtuHFOxzn+KlGaxdSodHF7klOq7NM3AC0cX8AkWqaU=
 =8+9t
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs

Pull NFS client updates from Anna Schumaker:
 "These patches include adding async support for the v4.2 COPY
  operation. I think Bruce is planning to send the server patches for
  the next release, but I figured we could get the client side out of
  the way now since it's been in my tree for a while. This shouldn't
  cause any problems, since the server will still respond with
  synchronous copies even if the client requests async.

  Features:
   - Add support for asynchronous server-side COPY operations

  Stable bufixes:
   - Fix an off-by-one in bl_map_stripe() (v3.17+)
   - NFSv4 client live hangs after live data migration recovery (v4.9+)
   - xprtrdma: Fix disconnect regression (v4.18+)
   - Fix locking in pnfs_generic_recover_commit_reqs (v4.14+)
   - Fix a sleep in atomic context in nfs4_callback_sequence() (v4.9+)

  Other bugfixes and cleanups:
   - Optimizations and fixes involving NFS v4.1 / pNFS layout handling
   - Optimize lseek(fd, SEEK_CUR, 0) on directories to avoid locking
   - Immediately reschedule writeback when the server replies with an
     error
   - Fix excessive attribute revalidation in nfs_execute_ok()
   - Add error checking to nfs_idmap_prepare_message()
   - Use new vm_fault_t return type
   - Return a delegation when reclaiming one that the server has
     recalled
   - Referrals should inherit proto setting from parents
   - Make rpc_auth_create_args a const
   - Improvements to rpc_iostats tracking
   - Fix a potential reference leak when there is an error processing a
     callback
   - Fix rmdir / mkdir / rename nlink accounting
   - Fix updating inode change attribute
   - Fix error handling in nfsn4_sp4_select_mode()
   - Use an appropriate work queue for direct-write completion
   - Don't busy wait if NFSv4 session draining is interrupted"

* tag 'nfs-for-4.19-1' of git://git.linux-nfs.org/projects/anna/linux-nfs: (54 commits)
  pNFS: Remove unwanted optimisation of layoutget
  pNFS/flexfiles: ff_layout_pg_init_read should exit on error
  pNFS: Treat RECALLCONFLICT like DELAY...
  pNFS: When updating the stateid in layoutreturn, also update the recall range
  NFSv4: Fix a sleep in atomic context in nfs4_callback_sequence()
  NFSv4: Fix locking in pnfs_generic_recover_commit_reqs
  NFSv4: Fix a typo in nfs4_init_channel_attrs()
  NFSv4: Don't busy wait if NFSv4 session draining is interrupted
  NFS recover from destination server reboot for copies
  NFS add a simple sync nfs4_proc_commit after async COPY
  NFS handle COPY ERR_OFFLOAD_NO_REQS
  NFS send OFFLOAD_CANCEL when COPY killed
  NFS export nfs4_async_handle_error
  NFS handle COPY reply CB_OFFLOAD call race
  NFS add support for asynchronous COPY
  NFS COPY xdr handle async reply
  NFS OFFLOAD_CANCEL xdr
  NFS CB_OFFLOAD xdr
  NFS: Use an appropriate work queue for direct-write completion
  NFSv4: Fix error handling in nfs4_sp4_select_mode()
  ...
2018-08-23 16:03:58 -07:00
Linus Torvalds
9157141c95 A mistake on my part caused me to tag my branch 6 commits too early,
missing Chuck's fixes for the problem with callbacks over GSS from
 multi-homed servers, and a smaller fix from Laura Abbott.
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABAgAGBQJbftA8AAoJECebzXlCjuG+QPMQALieEKkX0YoqRhPz5G+RrWFy
 KgOBFAoiRcjFQD6wMt9FzD6qYEZqSJ+I2b+K5N3BkdyDDQu845iD0wK0zBGhMgLm
 7ith85nphIMbe18+5jPorqAsI9RlfBQjiSGw1MEx5dicLQQzTObHL5q+l5jcWna4
 jWS3yUKv1URpOsR1hIryw74ktSnhuH8n//zmntw8aWrCkq3hnXOZK/agtYxZ7Viv
 V3kiQsiNpL2FPRcHN7ejhLUTnRkkuD2iYKrzP/SpTT/JfdNEUXlMhKkAySogNpus
 nvR9X7hwta8Lgrt7PSB9ibFTXtCupmuICg5mbDWy6nXea2NvpB01QhnTzrlX17Eh
 Yfk/18z95b6Qs1v4m3SI8ESmyc6l5dMZozLudtHzifyCqooWZriEhCR1PlQfQ/FJ
 4cYQ8U/qiMiZIJXL7N2wpSoSaWR5bqU1rXen29Np1WEDkiv4Nf5u2fsCXzv0ZH2C
 ReWpNkbnNxsNiKpp4geBZtlcSEU1pk+1PqE0MagTdBV3iptiUHRSP4jR7qLnc0zT
 J1lCvU7Fodnt9vNSxMpt2Jd6XxQ6xtx7n6aMQAiYFnXDs+hP2hPnJVCScnYW3L6R
 2r1sHRKKeoOzCJ2thw+zu4lOwMm7WPkJPWAYfv90reWkiKoy2vG0S9P7wsNGoJuW
 fuEjB2b9pow1Ffynat6q
 =JnLK
 -----END PGP SIGNATURE-----

Merge tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux

Pull nfsd updates from Bruce Fields:
 "Chuck Lever fixed a problem with NFSv4.0 callbacks over GSS from
  multi-homed servers.

  The only new feature is a minor bit of protocol (change_attr_type)
  which the client doesn't even use yet.

  Other than that, various bugfixes and cleanup"

* tag 'nfsd-4.19-1' of git://linux-nfs.org/~bfields/linux: (27 commits)
  sunrpc: Add comment defining gssd upcall API keywords
  nfsd: Remove callback_cred
  nfsd: Use correct credential for NFSv4.0 callback with GSS
  sunrpc: Extract target name into svc_cred
  sunrpc: Enable the kernel to specify the hostname part of service principals
  sunrpc: Don't use stack buffer with scatterlist
  rpc: remove unneeded variable 'ret' in rdma_listen_handler
  nfsd: use true and false for boolean values
  nfsd: constify write_op[]
  fs/nfsd: Delete invalid assignment statements in nfsd4_decode_exchange_id
  NFSD: Handle full-length symlinks
  NFSD: Refactor the generic write vector fill helper
  svcrdma: Clean up Read chunk path
  svcrdma: Avoid releasing a page in svc_xprt_release()
  nfsd: Mark expected switch fall-through
  sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
  nfsd: fix leaked file lock with nfs exported overlayfs
  nfsd: don't advertise a SCSI layout for an unsupported request_queue
  nfsd: fix corrupted reply to badly ordered compound
  nfsd: clarify check_op_ordering
  ...
2018-08-23 16:00:10 -07:00
Linus Torvalds
1290290c92 Second merge window update
- Switch SMC over to rdma_get_gid_attr and remove the compat
 
 - Fix a crash in HFI1 with some BIOS's
 
 - Fix a randconfig failure
 -----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEEfB7FMLh+8QxL+6i3OG33FX4gmxoFAlt9z6kACgkQOG33FX4g
 mxqLag//Zr3RrNSmswNjl+6PrWeg88GQstAm4gQps8mR03eI8Ha5fAS6MWuq1U8+
 LGP8bbBtE6fHSCkPSnN/owReZSTcTIbgWB021vBIUKUpytmxie+ugMVXIHwm6cIM
 E3pgPsRngTrqtuRxyOhyoaMKSjRCbSdMe43SWf42/S+9SxDkmXtBzymYT119bu/A
 eqU7z/HuRqcVxrgOinhPjcQkEFdRYUaRk+g6jzaQmZl8IjKHp/d3BSzmzbE6txMt
 txP5hu/msz4xkyusX/I6ZS3do+4WMIAMNhq9bVMo5pNu2RG1eo0LLmHgjsDsFI3u
 ZTaZQj6eYWlbwWRiOU+2Frf4kH4CGAyxlFVCr4yEvfe2VvEbZsLZbYuF1iasnqrP
 3Mrbh7Esj4MfDT/QbunnoX+49X/Rzma/a+tu6hqAhnGjvYsaDyy8WSNwHMs65f7m
 sii58arik9exJHVKMXZ5FJJydBYBcPt5IlkUdoTTZ1eLkoc2B9lG6lKZbWxUYNPx
 uS6XOyR4fcsmLfKfEvQuLZznWi3Jo4R17QKlL5ulqHrZUJV4PZ4Eiu5izScZ4ci/
 xcPGz1a4MT7Zg3V611qGSGt7zYDdD5+R+k/sDIOleCFIpueFzzh/I1oLkSIoXFIx
 DmwqCdRGeB5F7ZhfArFTezgiaMe55SPJKY5PAiThxzoostUGOfw=
 =6z1N
 -----END PGP SIGNATURE-----

Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma

Pull more rdma updates from Jason Gunthorpe:
 "This is the SMC cleanup promised, a randconfig regression fix, and
  kernel oops fix.

  Summary:

   - Switch SMC over to rdma_get_gid_attr and remove the compat

   - Fix a crash in HFI1 with some BIOS's

   - Fix a randconfig failure"

* tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
  IB/ucm: fix UCM link error
  IB/hfi1: Invalid NUMA node information can cause a divide by zero
  RDMA/smc: Replace ib_query_gid with rdma_get_gid_attr
2018-08-23 15:34:48 -07:00
Martin Willi
c1dc291205 netfilter: xt_cluster: add dependency on conntrack module
The cluster match requires conntrack for matching packets. If the
netns does not have conntrack hooks registered, the match does not
work at all.

Implicitly load the conntrack hook for the family, exactly as many
other extensions do. This ensures that the match works even if the
hooks have not been registered by other means.

Signed-off-by: Martin Willi <martin@strongswan.org>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-23 20:26:53 +02:00
Yue Haibing
444c826315 netfilter: conntrack: remove duplicated include from nf_conntrack_proto_udp.c
Remove duplicated include.

Fixes: c779e84960 ("netfilter: conntrack: remove get_timeout() indirection")
Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-23 20:24:26 +02:00
Hangbin Liu
d23c4b6336 net/ipv6: init ip6 anycast rt->dst.input as ip6_input
Commit 6edb3c96a5 ("net/ipv6: Defer initialization of dst to data path")
forgot to handle anycast route and init anycast rt->dst.input to ip6_forward.
Fix it by setting anycast rt->dst.input back to ip6_input.

Fixes: 6edb3c96a5 ("net/ipv6: Defer initialization of dst to data path")
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:48:37 -07:00
Kevin Yang
8e995bf14f tcp_bbr: apply PROBE_RTT cwnd cap even if acked==0
This commit fixes a corner case where TCP BBR would enter PROBE_RTT
mode but not reduce its cwnd. If a TCP receiver ACKed less than one
full segment, the number of delivered/acked packets was 0, so that
bbr_set_cwnd() would short-circuit and exit early, without cutting
cwnd to the value we want for PROBE_RTT.

The fix is to instead make sure that even when 0 full packets are
ACKed, we do apply all the appropriate caps, including the cap that
applies in PROBE_RTT mode.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:45:32 -07:00
Kevin Yang
5490b32dce tcp_bbr: in restart from idle, see if we should exit PROBE_RTT
This patch fix the case where BBR does not exit PROBE_RTT mode when
it restarts from idle. When BBR restarts from idle and if BBR is in
PROBE_RTT mode, BBR should check if it's time to exit PROBE_RTT. If
yes, then BBR should exit PROBE_RTT mode and restore the cwnd to its
full value.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Reviewed-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:45:32 -07:00
Kevin Yang
fb99886224 tcp_bbr: add bbr_check_probe_rtt_done() helper
This patch add a helper function bbr_check_probe_rtt_done() to
  1. check the condition to see if bbr should exit probe_rtt mode;
  2. process the logic of exiting probe_rtt mode.

Fixes: 0f8782ea14 ("tcp_bbr: add BBR congestion control")
Signed-off-by: Kevin Yang <yyd@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:45:32 -07:00
Eric Dumazet
431280eebe ipv4: tcp: send zero IPID for RST and ACK sent in SYN-RECV and TIME-WAIT state
tcp uses per-cpu (and per namespace) sockets (net->ipv4.tcp_sk) internally
to send some control packets.

1) RST packets, through tcp_v4_send_reset()
2) ACK packets in SYN-RECV and TIME-WAIT state, through tcp_v4_send_ack()

These packets assert IP_DF, and also use the hashed IP ident generator
to provide an IPv4 ID number.

Geoff Alexander reported this could be used to build off-path attacks.

These packets should not be fragmented, since their size is smaller than
IPV4_MIN_MTU. Only some tunneled paths could eventually have to fragment,
regardless of inner IPID.

We really can use zero IPID, to address the flaw, and as a bonus,
avoid a couple of atomic operations in ip_idents_reserve()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Geoff Alexander <alexandg@cs.unm.edu>
Tested-by: Geoff Alexander <alexandg@cs.unm.edu>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:42:58 -07:00
Cong Wang
e500c6d349 addrconf: reduce unnecessary atomic allocations
All the 3 callers of addrconf_add_mroute() assert RTNL
lock, they don't take any additional lock either, so
it is safe to convert it to GFP_KERNEL.

Same for sit_add_v4_addrs().

Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Reviewed-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:42:07 -07:00
Toke Høiland-Jørgensen
93cfb6c176 sch_cake: Fix TC filter flow override and expand it to hosts as well
The TC filter flow mapping override completely skipped the call to
cake_hash(); however that meant that the internal state was not being
updated, which ultimately leads to deadlocks in some configurations. Fix
that by passing the overridden flow ID into cake_hash() instead so it can
react appropriately.

In addition, the major number of the class ID can now be set to override
the host mapping in host isolation mode. If both host and flow are
overridden (or if the respective modes are disabled), flow dissection and
hashing will be skipped entirely; otherwise, the hashing will be kept for
the portions that are not set by the filter.

Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:39:45 -07:00
Samuel Mendoza-Jonas
3d0371b313 net/ncsi: Fixup .dumpit message flags and ID check in Netlink handler
The ncsi_pkg_info_all_nl() .dumpit handler is missing the NLM_F_MULTI
flag, causing additional package information after the first to be lost.
Also fixup a sanity check in ncsi_write_package_info() to reject out of
range package IDs.

Signed-off-by: Samuel Mendoza-Jonas <sam@mendozajonas.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-22 21:39:08 -07:00
Chuck Lever
108b833cde sunrpc: Add comment defining gssd upcall API keywords
During review, it was found that the target, service, and srchost
keywords are easily conflated. Add an explainer.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
9abdda5dda sunrpc: Extract target name into svc_cred
NFSv4.0 callback needs to know the GSS target name the client used
when it established its lease. That information is available from
the GSS context created by gssproxy. Make it available in each
svc_cred.

Note this will also give us access to the real target service
principal name (which is typically "nfs", but spec does not require
that).

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Chuck Lever
a1a237775e sunrpc: Enable the kernel to specify the hostname part of service principals
A multi-homed NFS server may have more than one "nfs" key in its
keytab. Enable the kernel to pick the key it wants as a machine
credential when establishing a GSS context.

This is useful for GSS-protected NFSv4.0 callbacks, which are
required by RFC 7530 S3.3.3 to use the same principal as the service
principal the client used when establishing its lease.

A complementary modification to rpc.gssd is required to fully enable
this feature.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
Laura Abbott
44090cc876 sunrpc: Don't use stack buffer with scatterlist
Fedora got a bug report from NFS:

kernel BUG at include/linux/scatterlist.h:143!
...
RIP: 0010:sg_init_one+0x7d/0x90
..
  make_checksum+0x4e7/0x760 [rpcsec_gss_krb5]
  gss_get_mic_kerberos+0x26e/0x310 [rpcsec_gss_krb5]
  gss_marshal+0x126/0x1a0 [auth_rpcgss]
  ? __local_bh_enable_ip+0x80/0xe0
  ? call_transmit_status+0x1d0/0x1d0 [sunrpc]
  call_transmit+0x137/0x230 [sunrpc]
  __rpc_execute+0x9b/0x490 [sunrpc]
  rpc_run_task+0x119/0x150 [sunrpc]
  nfs4_run_exchange_id+0x1bd/0x250 [nfsv4]
  _nfs4_proc_exchange_id+0x2d/0x490 [nfsv4]
  nfs41_discover_server_trunking+0x1c/0xa0 [nfsv4]
  nfs4_discover_server_trunking+0x80/0x270 [nfsv4]
  nfs4_init_client+0x16e/0x240 [nfsv4]
  ? nfs_get_client+0x4c9/0x5d0 [nfs]
  ? _raw_spin_unlock+0x24/0x30
  ? nfs_get_client+0x4c9/0x5d0 [nfs]
  nfs4_set_client+0xb2/0x100 [nfsv4]
  nfs4_create_server+0xff/0x290 [nfsv4]
  nfs4_remote_mount+0x28/0x50 [nfsv4]
  mount_fs+0x3b/0x16a
  vfs_kern_mount.part.35+0x54/0x160
  nfs_do_root_mount+0x7f/0xc0 [nfsv4]
  nfs4_try_mount+0x43/0x70 [nfsv4]
  ? get_nfs_version+0x21/0x80 [nfs]
  nfs_fs_mount+0x789/0xbf0 [nfs]
  ? pcpu_alloc+0x6ca/0x7e0
  ? nfs_clone_super+0x70/0x70 [nfs]
  ? nfs_parse_mount_options+0xb40/0xb40 [nfs]
  mount_fs+0x3b/0x16a
  vfs_kern_mount.part.35+0x54/0x160
  do_mount+0x1fd/0xd50
  ksys_mount+0xba/0xd0
  __x64_sys_mount+0x21/0x30
  do_syscall_64+0x60/0x1f0
  entry_SYSCALL_64_after_hwframe+0x49/0xbe

This is BUG_ON(!virt_addr_valid(buf)) triggered by using a stack
allocated buffer with a scatterlist. Convert the buffer for
rc4salt to be dynamically allocated instead.

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=1615258
Signed-off-by: Laura Abbott <labbott@redhat.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-22 18:32:07 -04:00
John Fastabend
67db7cd249 tls: possible hang when do_tcp_sendpages hits sndbuf is full case
Currently, the lower protocols sk_write_space handler is not called if
TLS is sending a scatterlist via  tls_push_sg. However, normally
tls_push_sg calls do_tcp_sendpage, which may be under memory pressure,
that in turn may trigger a wait via sk_wait_event. Typically, this
happens when the in-flight bytes exceed the sdnbuf size. In the normal
case when enough ACKs are received sk_write_space() will be called and
the sk_wait_event will be woken up allowing it to send more data
and/or return to the user.

But, in the TLS case because the sk_write_space() handler does not
wake up the events the above send will wait until the sndtimeo is
exceeded. By default this is MAX_SCHEDULE_TIMEOUT so it look like a
hang to the user (especially this impatient user). To fix this pass
the sk_write_space event to the lower layers sk_write_space event
which in the TCP case will wake any pending events.

I observed the above while integrating sockmap and ktls. It
initially appeared as test_sockmap (modified to use ktls) occasionally
hanging. To reliably reproduce this reduce the sndbuf size and stress
the tls layer by sending many 1B sends. This results in every byte
needing a header and each byte individually being sent to the crypto
layer.

Signed-off-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-22 21:57:14 +02:00
Matthew Wilcox
6e77cc4710 Convert net_namespace to new IDA API
Signed-off-by: Matthew Wilcox <willy@infradead.org>
2018-08-21 23:54:18 -04:00
Prashant Bhole
96c26e0458 xsk: fix return value of xdp_umem_assign_dev()
s/ENOTSUPP/EOPNOTSUPP/ in function umem_assign_dev().
This function's return value is directly returned by xsk_bind().
EOPNOTSUPP is bind()'s possible return value.

Fixes: f734607e81 ("xsk: refactor xdp_umem_assign_dev()")
Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp>
Acked-by: Song Liu <songliubraving@fb.com>
Acked-by: Björn Töpel <bjorn.topel@intel.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-21 22:06:53 +02:00
Cong Wang
5ffe57da29 act_ife: fix a potential deadlock
use_all_metadata() acquires read_lock(&ife_mod_lock), then calls
add_metainfo() which calls find_ife_oplist() which acquires the same
lock again. Deadlock!

Introduce __add_metainfo() which accepts struct tcf_meta_ops *ops
as an additional parameter and let its callers to decide how
to find it. For use_all_metadata(), it already has ops, no
need to find it again, just call __add_metainfo() directly.

And, as ife_mod_lock is only needed for find_ife_oplist(),
this means we can make non-atomic allocation for populate_metalist()
now.

Fixes: 817e9f2c5c ("act_ife: acquire ife_mod_lock before reading ifeoplist")
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:45 -07:00
Cong Wang
4e407ff5cd act_ife: move tcfa_lock down to where necessary
The only time we need to take tcfa_lock is when adding
a new metainfo to an existing ife->metalist. We don't need
to take tcfa_lock so early and so broadly in tcf_ife_init().

This means we can always take ife_mod_lock first, avoid the
reverse locking ordering warning as reported by Vlad.

Reported-by: Vlad Buslov <vladbu@mellanox.com>
Tested-by: Vlad Buslov <vladbu@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:45 -07:00
Cong Wang
8ce5be1c89 Revert "net: sched: act_ife: disable bh when taking ife_mod_lock"
This reverts commit 42c625a486 ("net: sched: act_ife: disable bh
when taking ife_mod_lock"), because what ife_mod_lock protects
is absolutely not touched in rate est timer BH context, they have
no race.

A better fix is following up.

Cc: Vlad Buslov <vladbu@mellanox.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:45 -07:00
Cong Wang
244cd96adb net_sched: remove list_head from tc_action
After commit 90b73b77d0, list_head is no longer needed.
Now we just need to convert the list iteration to array
iteration for drivers.

Fixes: 90b73b77d0 ("net: sched: change action API to use array of pointers to actions")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:44 -07:00
Cong Wang
7d485c451f net_sched: remove unused tcf_idr_check()
tcf_idr_check() is replaced by tcf_idr_check_alloc(),
and __tcf_idr_check() now can be folded into tcf_idr_search().

Fixes: 0190c1d452 ("net: sched: atomically check-allocate action")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:44 -07:00
Cong Wang
b144e7ec51 net_sched: remove unused parameter for tcf_action_delete()
Fixes: 16af606739 ("net: sched: implement reference counted action release")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:44 -07:00
Cong Wang
97a3f84f2c net_sched: remove unnecessary ops->delete()
All ops->delete() wants is getting the tn->idrinfo, but we already
have tc_action before calling ops->delete(), and tc_action has
a pointer ->idrinfo.

More importantly, each type of action does the same thing, that is,
just calling tcf_idr_delete_index().

So it can be just removed.

Fixes: b409074e66 ("net: sched: add 'delete' function to action ops")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:44 -07:00
Cong Wang
edfaf94fa7 net_sched: improve and refactor tcf_action_put_many()
tcf_action_put_many() is mostly called to clean up actions on
failure path, but tcf_action_put_many(&actions[acts_deleted]) is
used in the ugliest way: it passes a slice of the array and
uses an additional NULL at the end to avoid out-of-bound
access.

acts_deleted is completely unnecessary since we can teach
tcf_action_put_many() scan the whole array and checks against
NULL pointer. Which also means tcf_action_delete() should
set deleted action pointers to NULL to avoid double free.

Fixes: 90b73b77d0 ("net: sched: change action API to use array of pointers to actions")
Cc: Jiri Pirko <jiri@mellanox.com>
Cc: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 12:45:44 -07:00
Yue Haibing
c27f1e2e9f rds: tcp: remove duplicated include from tcp.c
Remove duplicated include.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 09:50:53 -07:00
Yue Haibing
093dee661d sch_cake: Remove unused including <linux/version.h>
Remove including <linux/version.h> that don't need it.

Signed-off-by: Yue Haibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-21 09:50:53 -07:00
Linus Torvalds
0a78ac4b9b The main things are support for cephx v2 authentication protocol and
basic support for rbd images within namespaces (myself).  Also included
 y2038 conversion patches from Arnd, a pile of miscellaneous fixes from
 Chengguang and Zheng's feature bit infrastructure for the filesystem.
 -----BEGIN PGP SIGNATURE-----
 
 iQFHBAABCAAxFiEEydHwtzie9C7TfviiSn/eOAIR84sFAlt62CkTHGlkcnlvbW92
 QGdtYWlsLmNvbQAKCRBKf944AhHzizfhB/0c/rz6frunc6EcZMWuBNzlOIOktJ/m
 MEbPGjCxMAsmidO1rqHHYF4iEN5hr+3AWTbtIL2m6wkqYVdg3FjmNaAYB27AdQMG
 kH9bLfrKIew72/NZqXfm25yjY/86kIt8t91kay4Lchc97tSYhnFSnku7iAX2HTND
 TMhq/1O/GvEyw/RmqnenJEQqFJvKnfgPPQm6W8sM2bH0T5j+EXmDT/Rv+90LogFR
 J4+pZkHqDfvyMb1WJ5MkumohytbRVzRNKcMpOvjquJSqUgtgZa2JdrIsypDqSNKY
 nUT6jGGlxoSbHCqRwDJoFEJOlh5A9RwKqYxNuM2a/vs9u7HpvdCK/Iah
 =AtgY
 -----END PGP SIGNATURE-----

Merge tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client

Pull ceph updates from Ilya Dryomov:
 "The main things are support for cephx v2 authentication protocol and
  basic support for rbd images within namespaces (myself).

  Also included are y2038 conversion patches from Arnd, a pile of
  miscellaneous fixes from Chengguang and Zheng's feature bit
  infrastructure for the filesystem"

* tag 'ceph-for-4.19-rc1' of git://github.com/ceph/ceph-client: (40 commits)
  ceph: don't drop message if it contains more data than expected
  ceph: support cephfs' own feature bits
  crush: fix using plain integer as NULL warning
  libceph: remove unnecessary non NULL check for request_key
  ceph: refactor error handling code in ceph_reserve_caps()
  ceph: refactor ceph_unreserve_caps()
  ceph: change to void return type for __do_request()
  ceph: compare fsc->max_file_size and inode->i_size for max file size limit
  ceph: add additional size check in ceph_setattr()
  ceph: add additional offset check in ceph_write_iter()
  ceph: add additional range check in ceph_fallocate()
  ceph: add new field max_file_size in ceph_fs_client
  libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
  libceph: check authorizer reply/challenge length before reading
  libceph: implement CEPHX_V2 calculation mode
  libceph: add authorizer challenge
  libceph: factor out encrypt_authorizer()
  libceph: factor out __ceph_x_decrypt()
  libceph: factor out __prepare_write_connect()
  libceph: store ceph_auth_handshake pointer in ceph_connection
  ...
2018-08-20 18:26:55 -07:00
David Ahern
80f1a0f4e0 net/ipv6: Put lwtstate when destroying fib6_info
Prior to the introduction of fib6_info lwtstate was managed by the dst
code. With fib6_info releasing lwtstate needs to be done when the struct
is freed.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-20 18:18:40 -07:00
Johannes Berg
8a54d8fc16 cfg80211: remove division by size of sizeof(struct ieee80211_wmm_rule)
Pointer arithmetic already adjusts by the size of the struct,
so the sizeof() calculation is wrong. This is basically the
same as Colin King's patch for similar code in the iwlwifi
driver.

Fixes: 230ebaa189 ("cfg80211: read wmm rules from regulatory database")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-20 09:36:33 +02:00
Haishuang Yan
9c86336c15 ip6_vti: fix a null pointer deference when destroy vti6 tunnel
If load ip6_vti module and create a network namespace when set
fb_tunnels_only_for_init_net to 1, then exit the namespace will
cause following crash:

[ 6601.677036] BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
[ 6601.679057] PGD 8000000425eca067 P4D 8000000425eca067 PUD 424292067 PMD 0
[ 6601.680483] Oops: 0000 [#1] SMP PTI
[ 6601.681223] CPU: 7 PID: 93 Comm: kworker/u16:1 Kdump: loaded Tainted: G            E     4.18.0+ #3
[ 6601.683153] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014
[ 6601.684919] Workqueue: netns cleanup_net
[ 6601.685742] RIP: 0010:vti6_exit_batch_net+0x87/0xd0 [ip6_vti]
[ 6601.686932] Code: 7b 08 48 89 e6 e8 b9 ea d3 dd 48 8b 1b 48 85 db 75 ec 48 83 c5 08 48 81 fd 00 01 00 00 75 d5 49 8b 84 24 08 01 00 00 48 89 e6 <48> 8b 78 08 e8 90 ea d3 dd 49 8b 45 28 49 39 c6 4c 8d 68 d8 75 a1
[ 6601.690735] RSP: 0018:ffffa897c2737de0 EFLAGS: 00010246
[ 6601.691846] RAX: 0000000000000000 RBX: 0000000000000000 RCX: dead000000000200
[ 6601.693324] RDX: 0000000000000015 RSI: ffffa897c2737de0 RDI: ffffffff9f2ea9e0
[ 6601.694824] RBP: 0000000000000100 R08: 0000000000000000 R09: 0000000000000000
[ 6601.696314] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8dc323c07e00
[ 6601.697812] R13: ffff8dc324a63100 R14: ffffa897c2737e30 R15: ffffa897c2737e30
[ 6601.699345] FS:  0000000000000000(0000) GS:ffff8dc33fdc0000(0000) knlGS:0000000000000000
[ 6601.701068] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 6601.702282] CR2: 0000000000000008 CR3: 0000000424966002 CR4: 00000000001606e0
[ 6601.703791] Call Trace:
[ 6601.704329]  cleanup_net+0x1b4/0x2c0
[ 6601.705268]  process_one_work+0x16c/0x370
[ 6601.706145]  worker_thread+0x49/0x3e0
[ 6601.706942]  kthread+0xf8/0x130
[ 6601.707626]  ? rescuer_thread+0x340/0x340
[ 6601.708476]  ? kthread_bind+0x10/0x10
[ 6601.709266]  ret_from_fork+0x35/0x40

Reproduce:
modprobe ip6_vti
echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
unshare -n
exit

This because ip6n->tnls_wc[0] point to fallback device in default, but
in non-default namespace, ip6n->tnls_wc[0] will be NULL, so add the NULL
check comparatively.

Fixes: e2948e5af8 ("ip6_vti: fix creating fallback tunnel device for vti6")
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19 19:55:58 -07:00
Linus Torvalds
2ad0d52699 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) Fix races in IPVS, from Tan Hu.

 2) Missing unbind in matchall classifier, from Hangbin Liu.

 3) Missing act_ife action release, from Vlad Buslov.

 4) Cure lockdep splats in ila, from Cong Wang.

 5) veth queue leak on link delete, from Toshiaki Makita.

 6) Disable isdn's IIOCDBGVAR ioctl, it exposes kernel addresses. From
    Kees Cook.

 7) RCU usage fixup in XDP, from Tariq Toukan.

 8) Two TCP ULP fixes from Daniel Borkmann.

 9) r8169 needs REALTEK_PHY as a Kconfig dependency, from Heiner
    Kallweit.

10) Always take tcf_lock with BH disabled, otherwise we can deadlock
    with rate estimator code paths. From Vlad Buslov.

11) Don't use MSI-X on RTL8106e r8169 chips, they don't resume properly.
    From Jian-Hong Pan.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (41 commits)
  ip6_vti: fix creating fallback tunnel device for vti6
  ip_vti: fix a null pointer deferrence when create vti fallback tunnel
  r8169: don't use MSI-X on RTL8106e
  net: lan743x_ptp: convert to ktime_get_clocktai_ts64
  net: sched: always disable bh when taking tcf_lock
  ip6_vti: simplify stats handling in vti6_xmit
  bpf: fix redirect to map under tail calls
  r8169: add missing Kconfig dependency
  tools/bpf: fix bpf selftest test_cgroup_storage failure
  bpf, sockmap: fix sock_map_ctx_update_elem race with exist/noexist
  bpf, sockmap: fix map elem deletion race with smap_stop_sock
  bpf, sockmap: fix leakage of smap_psock_map_entry
  tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
  tcp, ulp: add alias for all ulp modules
  bpf: fix a rcu usage warning in bpf_prog_array_copy_core()
  samples/bpf: all XDP samples should unload xdp/bpf prog on SIGTERM
  net/xdp: Fix suspicious RCU usage warning
  net/mlx5e: Delete unneeded function argument
  Documentation: networking: ti-cpsw: correct cbs parameters for Eth1 100Mb
  isdn: Disable IIOCDBGVAR
  ...
2018-08-19 11:51:45 -07:00
Haishuang Yan
e2948e5af8 ip6_vti: fix creating fallback tunnel device for vti6
When set fb_tunnels_only_for_init_net to 1, don't create fallback tunnel
device for vti6 when a new namespace is created.

Tested:
[root@builder2 ~]# modprobe ip6_tunnel
[root@builder2 ~]# modprobe ip6_vti
[root@builder2 ~]# echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
[root@builder2 ~]# unshare -n
[root@builder2 ~]# ip link
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN mode DEFAULT group
default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19 11:26:39 -07:00
Haishuang Yan
cd1aa9c2c6 ip_vti: fix a null pointer deferrence when create vti fallback tunnel
After set fb_tunnels_only_for_init_net to 1, the itn->fb_tunnel_dev will
be NULL and will cause following crash:

[ 2742.849298] BUG: unable to handle kernel NULL pointer dereference at 0000000000000941
[ 2742.851380] PGD 800000042c21a067 P4D 800000042c21a067 PUD 42aaed067 PMD 0
[ 2742.852818] Oops: 0002 [#1] SMP PTI
[ 2742.853570] CPU: 7 PID: 2484 Comm: unshare Kdump: loaded Not tainted 4.18.0-rc8+ #2
[ 2742.855163] Hardware name: Fedora Project OpenStack Nova, BIOS seabios-1.7.5-11.el7 04/01/2014
[ 2742.856970] RIP: 0010:vti_init_net+0x3a/0x50 [ip_vti]
[ 2742.858034] Code: 90 83 c0 48 c7 c2 20 a1 83 c0 48 89 fb e8 6e 3b f6 ff 85 c0 75 22 8b 0d f4 19 00 00 48 8b 93 00 14 00 00 48 8b 14 ca 48 8b 12 <c6> 82 41 09 00 00 04 c6 82 38 09 00 00 45 5b c3 66 0f 1f 44 00 00
[ 2742.861940] RSP: 0018:ffff9be28207fde0 EFLAGS: 00010246
[ 2742.863044] RAX: 0000000000000000 RBX: ffff8a71ebed4980 RCX: 0000000000000013
[ 2742.864540] RDX: 0000000000000000 RSI: 0000000000000013 RDI: ffff8a71ebed4980
[ 2742.866020] RBP: ffff8a71ea717000 R08: ffffffffc083903c R09: ffff8a71ea717000
[ 2742.867505] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a71ebed4980
[ 2742.868987] R13: 0000000000000013 R14: ffff8a71ea5b49c0 R15: 0000000000000000
[ 2742.870473] FS:  00007f02266c9740(0000) GS:ffff8a71ffdc0000(0000) knlGS:0000000000000000
[ 2742.872143] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 2742.873340] CR2: 0000000000000941 CR3: 000000042bc20006 CR4: 00000000001606e0
[ 2742.874821] Call Trace:
[ 2742.875358]  ops_init+0x38/0xf0
[ 2742.876078]  setup_net+0xd9/0x1f0
[ 2742.876789]  copy_net_ns+0xb7/0x130
[ 2742.877538]  create_new_namespaces+0x11a/0x1d0
[ 2742.878525]  unshare_nsproxy_namespaces+0x55/0xa0
[ 2742.879526]  ksys_unshare+0x1a7/0x330
[ 2742.880313]  __x64_sys_unshare+0xe/0x20
[ 2742.881131]  do_syscall_64+0x5b/0x180
[ 2742.881933]  entry_SYSCALL_64_after_hwframe+0x44/0xa9

Reproduce:
echo 1 > /proc/sys/net/core/fb_tunnels_only_for_init_net
modprobe ip_vti
unshare -n

Fixes: 79134e6ce2 ("net: do not create fallback tunnels for non-default namespaces")
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19 11:26:39 -07:00
Vlad Buslov
653cd284a8 net: sched: always disable bh when taking tcf_lock
Recently, ops->init() and ops->dump() of all actions were modified to
always obtain tcf_lock when accessing private action state. Actions that
don't depend on tcf_lock for synchronization with their data path use
non-bh locking API. However, tcf_lock is also used to protect rate
estimator stats in softirq context by timer callback.

Change ops->init() and ops->dump() of all actions to disable bh when using
tcf_lock to prevent deadlock reported by following lockdep warning:

[  105.470398] ================================
[  105.475014] WARNING: inconsistent lock state
[  105.479628] 4.18.0-rc8+ #664 Not tainted
[  105.483897] --------------------------------
[  105.488511] inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
[  105.494871] swapper/16/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
[  105.500449] 00000000f86c012e (&(&p->tcfa_lock)->rlock){+.?.}, at: est_fetch_counters+0x3c/0xa0
[  105.509696] {SOFTIRQ-ON-W} state was registered at:
[  105.514925]   _raw_spin_lock+0x2c/0x40
[  105.519022]   tcf_bpf_init+0x579/0x820 [act_bpf]
[  105.523990]   tcf_action_init_1+0x4e4/0x660
[  105.528518]   tcf_action_init+0x1ce/0x2d0
[  105.532880]   tcf_exts_validate+0x1d8/0x200
[  105.537416]   fl_change+0x55a/0x268b [cls_flower]
[  105.542469]   tc_new_tfilter+0x748/0xa20
[  105.546738]   rtnetlink_rcv_msg+0x56a/0x6d0
[  105.551268]   netlink_rcv_skb+0x18d/0x200
[  105.555628]   netlink_unicast+0x2d0/0x370
[  105.559990]   netlink_sendmsg+0x3b9/0x6a0
[  105.564349]   sock_sendmsg+0x6b/0x80
[  105.568271]   ___sys_sendmsg+0x4a1/0x520
[  105.572547]   __sys_sendmsg+0xd7/0x150
[  105.576655]   do_syscall_64+0x72/0x2c0
[  105.580757]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  105.586243] irq event stamp: 489296
[  105.590084] hardirqs last  enabled at (489296): [<ffffffffb507e639>] _raw_spin_unlock_irq+0x29/0x40
[  105.599765] hardirqs last disabled at (489295): [<ffffffffb507e745>] _raw_spin_lock_irq+0x15/0x50
[  105.609277] softirqs last  enabled at (489292): [<ffffffffb413a6a3>] irq_enter+0x83/0xa0
[  105.618001] softirqs last disabled at (489293): [<ffffffffb413a800>] irq_exit+0x140/0x190
[  105.626813]
               other info that might help us debug this:
[  105.633976]  Possible unsafe locking scenario:

[  105.640526]        CPU0
[  105.643325]        ----
[  105.646125]   lock(&(&p->tcfa_lock)->rlock);
[  105.650747]   <Interrupt>
[  105.653717]     lock(&(&p->tcfa_lock)->rlock);
[  105.658514]
                *** DEADLOCK ***

[  105.665349] 1 lock held by swapper/16/0:
[  105.669629]  #0: 00000000a640ad99 ((&est->timer)){+.-.}, at: call_timer_fn+0x10b/0x550
[  105.678200]
               stack backtrace:
[  105.683194] CPU: 16 PID: 0 Comm: swapper/16 Not tainted 4.18.0-rc8+ #664
[  105.690249] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017
[  105.698626] Call Trace:
[  105.701421]  <IRQ>
[  105.703791]  dump_stack+0x92/0xeb
[  105.707461]  print_usage_bug+0x336/0x34c
[  105.711744]  mark_lock+0x7c9/0x980
[  105.715500]  ? print_shortest_lock_dependencies+0x2e0/0x2e0
[  105.721424]  ? check_usage_forwards+0x230/0x230
[  105.726315]  __lock_acquire+0x923/0x26f0
[  105.730597]  ? debug_show_all_locks+0x240/0x240
[  105.735478]  ? mark_lock+0x493/0x980
[  105.739412]  ? check_chain_key+0x140/0x1f0
[  105.743861]  ? __lock_acquire+0x836/0x26f0
[  105.748323]  ? lock_acquire+0x12e/0x290
[  105.752516]  lock_acquire+0x12e/0x290
[  105.756539]  ? est_fetch_counters+0x3c/0xa0
[  105.761084]  _raw_spin_lock+0x2c/0x40
[  105.765099]  ? est_fetch_counters+0x3c/0xa0
[  105.769633]  est_fetch_counters+0x3c/0xa0
[  105.773995]  est_timer+0x87/0x390
[  105.777670]  ? est_fetch_counters+0xa0/0xa0
[  105.782210]  ? lock_acquire+0x12e/0x290
[  105.786410]  call_timer_fn+0x161/0x550
[  105.790512]  ? est_fetch_counters+0xa0/0xa0
[  105.795055]  ? del_timer_sync+0xd0/0xd0
[  105.799249]  ? __lock_is_held+0x93/0x110
[  105.803531]  ? mark_held_locks+0x20/0xe0
[  105.807813]  ? _raw_spin_unlock_irq+0x29/0x40
[  105.812525]  ? est_fetch_counters+0xa0/0xa0
[  105.817069]  ? est_fetch_counters+0xa0/0xa0
[  105.821610]  run_timer_softirq+0x3c4/0x9f0
[  105.826064]  ? lock_acquire+0x12e/0x290
[  105.830257]  ? __bpf_trace_timer_class+0x10/0x10
[  105.835237]  ? __lock_is_held+0x25/0x110
[  105.839517]  __do_softirq+0x11d/0x7bf
[  105.843542]  irq_exit+0x140/0x190
[  105.847208]  smp_apic_timer_interrupt+0xac/0x3b0
[  105.852182]  apic_timer_interrupt+0xf/0x20
[  105.856628]  </IRQ>
[  105.859081] RIP: 0010:cpuidle_enter_state+0xd8/0x4d0
[  105.864395] Code: 46 ff 48 89 44 24 08 0f 1f 44 00 00 31 ff e8 cf ec 46 ff 80 7c 24 07 00 0f 85 1d 02 00 00 e8 9f 90 4b ff fb 66 0f 1f 44 00 00 <4c> 8b 6c 24 08 4d 29 fd 0f 80 36 03 00 00 4c 89 e8 48 ba cf f7 53
[  105.884288] RSP: 0018:ffff8803ad94fd20 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff13
[  105.892494] RAX: 0000000000000000 RBX: ffffe8fb300829c0 RCX: ffffffffb41e19e1
[  105.899988] RDX: 0000000000000007 RSI: dffffc0000000000 RDI: ffff8803ad9358ac
[  105.907503] RBP: ffffffffb6636300 R08: 0000000000000004 R09: 0000000000000000
[  105.914997] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000004
[  105.922487] R13: ffffffffb6636140 R14: ffffffffb66362d8 R15: 000000188d36091b
[  105.929988]  ? trace_hardirqs_on_caller+0x141/0x2d0
[  105.935232]  do_idle+0x28e/0x320
[  105.938817]  ? arch_cpu_idle_exit+0x40/0x40
[  105.943361]  ? mark_lock+0x8c1/0x980
[  105.947295]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[  105.952619]  cpu_startup_entry+0xc2/0xd0
[  105.956900]  ? cpu_in_idle+0x20/0x20
[  105.960830]  ? _raw_spin_unlock_irqrestore+0x32/0x60
[  105.966146]  ? trace_hardirqs_on_caller+0x141/0x2d0
[  105.971391]  start_secondary+0x2b5/0x360
[  105.975669]  ? set_cpu_sibling_map+0x1330/0x1330
[  105.980654]  secondary_startup_64+0xa5/0xb0

Taking tcf_lock in sample action with bh disabled causes lockdep to issue a
warning regarding possible irq lock inversion dependency between tcf_lock,
and psample_groups_lock that is taken when holding tcf_lock in sample init:

[  162.108959]  Possible interrupt unsafe locking scenario:

[  162.116386]        CPU0                    CPU1
[  162.121277]        ----                    ----
[  162.126162]   lock(psample_groups_lock);
[  162.130447]                                local_irq_disable();
[  162.136772]                                lock(&(&p->tcfa_lock)->rlock);
[  162.143957]                                lock(psample_groups_lock);
[  162.150813]   <Interrupt>
[  162.153808]     lock(&(&p->tcfa_lock)->rlock);
[  162.158608]
                *** DEADLOCK ***

In order to prevent potential lock inversion dependency between tcf_lock
and psample_groups_lock, extract call to psample_group_get() from tcf_lock
protected section in sample action init function.

Fixes: 4e232818bd ("net: sched: act_mirred: remove dependency on rtnl lock")
Fixes: 764e9a2448 ("net: sched: act_vlan: remove dependency on rtnl lock")
Fixes: 729e012609 ("net: sched: act_tunnel_key: remove dependency on rtnl lock")
Fixes: d772849566 ("net: sched: act_sample: remove dependency on rtnl lock")
Fixes: e8917f4370 ("net: sched: act_gact: remove dependency on rtnl lock")
Fixes: b6a2b971c0 ("net: sched: act_csum: remove dependency on rtnl lock")
Fixes: 2142236b45 ("net: sched: act_bpf: remove dependency on rtnl lock")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-19 10:46:21 -07:00
Haishuang Yan
bb107456ba ip6_vti: simplify stats handling in vti6_xmit
Same as ip_vti, use iptunnel_xmit_stats to updates stats in tunnel xmit
code path.

Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18 13:46:54 -07:00
David S. Miller
6e3bf9b04f Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-08-18

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix a BPF selftest failure in test_cgroup_storage due to rlimit
   restrictions, from Yonghong.

2) Fix a suspicious RCU rcu_dereference_check() warning triggered
   from removing a device's XDP memory allocator by using the correct
   rhashtable lookup function, from Tariq.

3) A batch of BPF sockmap and ULP fixes mainly fixing leaks and races
   as well as enforcing module aliases for ULPs. Another fix for BPF
   map redirect to make them work again with tail calls, from Daniel.

4) Fix XDP BPF samples to unload their programs upon SIGTERM, from Jesper.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18 10:02:49 -07:00
David S. Miller
3fe49d699a Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter/IPVS fixes for net

The following patchset contains Netfilter/IPVS fixes for your net tree:

1) Infinite loop in IPVS when net namespace is released, from
   Tan Hu.

2) Do not show negative timeouts in ip_vs_conn by using the new
   jiffies_delta_to_msecs(), patches from Matteo Croce.

3) Set F_IFACE flag for linklocal addresses in ip6t_rpfilter,
   from Florian Westphal.

4) Fix overflow in set size allocation, from Taehee Yoo.

5) Use netlink_dump_start() from ctnetlink to fix memleak from
   the error path, again from Florian.

6) Register nfnetlink_subsys in last place, otherwise netns
   init path may lose race and see net->nft uninitialized data.
   This also reverts previous attempt to fix this by increase
   netns refcount, patches from Florian.

7) Remove conntrack entries on layer 4 protocol tracker module
   removal, from Florian.

8) Use GFP_KERNEL_ACCOUNT for xtables blob allocation, from
   Michal Hocko.

9) Get tproxy documentation in sync with existing codebase,
   from Mate Eckl.

10) Honor preset layer 3 protocol via ctx->family in the new nft_ct
    timeout infrastructure, from Harsha Sharma.

11) Let uapi nfnetlink_osf.h compile standalone with no errors,
    from Dmitry V. Levin.

12) Missing braces compilation warning in nft_tproxy, patch from
    Mate Eclk.

13) Disregard bogus check to bail out on non-anonymous sets from
    the dynamic set update extension.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-18 09:59:19 -07:00
Linus Torvalds
1f7a4c73a7 Pull request for inclusion in 4.19, take two
This tag is the same as 9p-for-4.19 without the two MAINTAINERS patches
 
 Contains mostly fixes (6 to be backported to stable) and a few changes,
 here is the breakdown:
  * Rework how fids are attributed by replacing some custom tracking in a
 list by an idr (f28cdf0430)
  * For packet-based transports (virtio/rdma) validate that the packet
 length matches what the header says (f984579a01)
  * A few race condition fixes found by syzkaller (9f476d7c54,
 430ac66eb4)
  * Missing argument check when NULL device is passed in sys_mount
 (10aa14527f)
  * A few virtio fixes (23cba9cbde, 31934da810, d28c756cae)
  * Some spelling and style fixes
 
 ----------------------------------------------------------------
 Chirantan Ekbote (1):
       9p/net: Fix zero-copy path in the 9p virtio transport
 
 Colin Ian King (1):
       fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
 
 Jean-Philippe Brucker (1):
       net/9p: fix error path of p9_virtio_probe
 
 Matthew Wilcox (4):
       9p: Fix comment on smp_wmb
       9p: Change p9_fid_create calling convention
       9p: Replace the fidlist with an IDR
       9p: Embed wait_queue_head into p9_req_t
 
 Souptick Joarder (1):
       fs/9p/vfs_file.c: use new return type vm_fault_t
 
 Stephen Hemminger (1):
       9p: fix whitespace issues
 
 Tomas Bortoli (5):
       net/9p/client.c: version pointer uninitialized
       net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
       net/9p/trans_fd.c: fix race by holding the lock
       9p: validate PDU length
       9p: fix multiple NULL-pointer-dereferences
 
 jiangyiwen (2):
       net/9p/virtio: Fix hard lockup in req_done
       9p/virtio: fix off-by-one error in sg list bounds check
 
 piaojun (5):
       net/9p/client.c: add missing '\n' at the end of p9_debug()
       9p/net/protocol.c: return -ENOMEM when kmalloc() failed
       net/9p/trans_virtio.c: fix some spell mistakes in comments
       fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
       net/9p/trans_virtio.c: add null terminal for mount tag
 
  fs/9p/v9fs.c            |   2 +-
  fs/9p/vfs_file.c        |   2 +-
  fs/9p/xattr.c           |   6 ++++--
  include/net/9p/client.h |  11 ++++-------
  net/9p/client.c         | 119 +++++++++++++++++++++++++++++++++++++++++++++++++-------------------------------------------------------------------
  net/9p/protocol.c       |   2 +-
  net/9p/trans_fd.c       |  22 +++++++++++++++-------
  net/9p/trans_rdma.c     |   4 ++++
  net/9p/trans_virtio.c   |  66 +++++++++++++++++++++++++++++++++++++---------------------------
  net/9p/trans_xen.c      |   3 +++
  net/9p/util.c           |   1 -
  12 files changed, 122 insertions(+), 116 deletions(-)
 -----BEGIN PGP SIGNATURE-----
 
 iF0EABECAB0WIQQ8idm2ZSicIMLgzKqoqIItDqvwPAUCW3ElNwAKCRCoqIItDqvw
 PMzfAKCkCYFyNC89vcpxcCNsK7rFQ1qKlwCgoaBpZDdegOu0jMB7cyKwAWrB0LM=
 =h3T0
 -----END PGP SIGNATURE-----

Merge tag '9p-for-4.19-2' of git://github.com/martinetd/linux

Pull 9p updates from Dominique Martinet:
 "This contains mostly fixes (6 to be backported to stable) and a few
  changes, here is the breakdown:

   - rework how fids are attributed by replacing some custom tracking in
     a list by an idr

   - for packet-based transports (virtio/rdma) validate that the packet
     length matches what the header says

   - a few race condition fixes found by syzkaller

   - missing argument check when NULL device is passed in sys_mount

   - a few virtio fixes

   - some spelling and style fixes"

* tag '9p-for-4.19-2' of git://github.com/martinetd/linux: (21 commits)
  net/9p/trans_virtio.c: add null terminal for mount tag
  9p/virtio: fix off-by-one error in sg list bounds check
  9p: fix whitespace issues
  9p: fix multiple NULL-pointer-dereferences
  fs/9p/xattr.c: catch the error of p9_client_clunk when setting xattr failed
  9p: validate PDU length
  net/9p/trans_fd.c: fix race by holding the lock
  net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
  net/9p/virtio: Fix hard lockup in req_done
  net/9p/trans_virtio.c: fix some spell mistakes in comments
  9p/net: Fix zero-copy path in the 9p virtio transport
  9p: Embed wait_queue_head into p9_req_t
  9p: Replace the fidlist with an IDR
  9p: Change p9_fid_create calling convention
  9p: Fix comment on smp_wmb
  net/9p/client.c: version pointer uninitialized
  fs/9p/v9fs.c: fix spelling mistake "Uknown" -> "Unknown"
  net/9p: fix error path of p9_virtio_probe
  9p/net/protocol.c: return -ENOMEM when kmalloc() failed
  net/9p/client.c: add missing '\n' at the end of p9_debug()
  ...
2018-08-17 17:27:58 -07:00
Daniel Borkmann
f6069b9aa9 bpf: fix redirect to map under tail calls
Commits 109980b894 ("bpf: don't select potentially stale ri->map
from buggy xdp progs") and 7c30013133 ("bpf: fix ri->map_owner
pointer on bpf_prog_realloc") tried to mitigate that buggy programs
using bpf_redirect_map() helper call do not leave stale maps behind.
Idea was to add a map_owner cookie into the per CPU struct redirect_info
which was set to prog->aux by the prog making the helper call as a
proof that the map is not stale since the prog is implicitly holding
a reference to it. This owner cookie could later on get compared with
the program calling into BPF whether they match and therefore the
redirect could proceed with processing the map safely.

In (obvious) hindsight, this approach breaks down when tail calls are
involved since the original caller's prog->aux pointer does not have
to match the one from one of the progs out of the tail call chain,
and therefore the xdp buffer will be dropped instead of redirected.
A way around that would be to fix the issue differently (which also
allows to remove related work in fast path at the same time): once
the life-time of a redirect map has come to its end we use it's map
free callback where we need to wait on synchronize_rcu() for current
outstanding xdp buffers and remove such a map pointer from the
redirect info if found to be present. At that time no program is
using this map anymore so we simply invalidate the map pointers to
NULL iff they previously pointed to that instance while making sure
that the redirect path only reads out the map once.

Fixes: 97f91a7cf0 ("bpf: add bpf_redirect_map helper routine")
Fixes: 109980b894 ("bpf: don't select potentially stale ri->map from buggy xdp progs")
Reported-by: Sebastiano Miano <sebastiano.miano@polito.it>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-17 15:56:23 -07:00
Jason Gunthorpe
b4c296f9c9 RDMA/smc: Replace ib_query_gid with rdma_get_gid_attr
All RDMA ULPs should be using rdma_get_gid_attr instead of
ib_query_gid. Convert SMC to use the new API.

In the process correct some confusion with gid_type - if attr->ndev is
!NULL then gid_type can never be IB_GID_TYPE_IB by
definition. IB_GID_TYPE_ROCE shares the same enum value and is probably
what was intended here.

Reviewed-by: Parav Pandit <parav@mellanox.com>
Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-17 16:45:51 -06:00
Daniel Borkmann
90545cdc3f tcp, ulp: fix leftover icsk_ulp_ops preventing sock from reattach
I found that in BPF sockmap programs once we either delete a socket
from the map or we updated a map slot and the old socket was purged
from the map that these socket can never get reattached into a map
even though their related psock has been dropped entirely at that
point.

Reason is that tcp_cleanup_ulp() leaves the old icsk->icsk_ulp_ops
intact, so that on the next tcp_set_ulp_id() the kernel returns an
-EEXIST thinking there is still some active ULP attached.

BPF sockmap is the only one that has this issue as the other user,
kTLS, only calls tcp_cleanup_ulp() from tcp_v4_destroy_sock() whereas
sockmap semantics allow dropping the socket from the map with all
related psock state being cleaned up.

Fixes: 1aa12bdf1b ("bpf: sockmap, add sock close() hook to remove socks")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16 14:58:08 -07:00
Daniel Borkmann
037b0b86ec tcp, ulp: add alias for all ulp modules
Lets not turn the TCP ULP lookup into an arbitrary module loader as
we only intend to load ULP modules through this mechanism, not other
unrelated kernel modules:

  [root@bar]# cat foo.c
  #include <sys/types.h>
  #include <sys/socket.h>
  #include <linux/tcp.h>
  #include <linux/in.h>

  int main(void)
  {
      int sock = socket(PF_INET, SOCK_STREAM, 0);
      setsockopt(sock, IPPROTO_TCP, TCP_ULP, "sctp", sizeof("sctp"));
      return 0;
  }

  [root@bar]# gcc foo.c -O2 -Wall
  [root@bar]# lsmod | grep sctp
  [root@bar]# ./a.out
  [root@bar]# lsmod | grep sctp
  sctp                 1077248  4
  libcrc32c              16384  3 nf_conntrack,nf_nat,sctp
  [root@bar]#

Fix it by adding module alias to TCP ULP modules, so probing module
via request_module() will be limited to tcp-ulp-[name]. The existing
modules like kTLS will load fine given tcp-ulp-tls alias, but others
will fail to load:

  [root@bar]# lsmod | grep sctp
  [root@bar]# ./a.out
  [root@bar]# lsmod | grep sctp
  [root@bar]#

Sockmap is not affected from this since it's either built-in or not.

Fixes: 734942cc4e ("tcp: ULP infrastructure")
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: John Fastabend <john.fastabend@gmail.com>
Acked-by: Song Liu <songliubraving@fb.com>
Signed-off-by: Alexei Starovoitov <ast@kernel.org>
2018-08-16 14:58:07 -07:00
Jason Gunthorpe
0a3173a5f0 Merge branch 'linus/master' into rdma.git for-next
rdma.git merge resolution for the 4.19 merge window

Conflicts:
 drivers/infiniband/core/rdma_core.c
   - Use the rdma code and revise with the new spelling for
     atomic_fetch_add_unless
 drivers/nvme/host/rdma.c
   - Replace max_sge with max_send_sge in new blk code
 drivers/nvme/target/rdma.c
   - Use the blk code and revise to use NULL for ib_post_recv when
     appropriate
   - Replace max_sge with max_recv_sge in new blk code
 net/rds/ib_send.c
   - Use the net code and revise to use NULL for ib_post_recv when
     appropriate

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 14:21:29 -06:00
Jason Gunthorpe
92f4e77c85 Revert "net/smc: Replace ib_query_gid with rdma_get_gid_attr"
This reverts commit ddb457c699.

The include rdma/ib_cache.h is kept, and we have to add a memset
to the compat wrapper to avoid compiler warnings in gcc-7

This revert is done to avoid extensive merge conflicts with SMC
changes in netdev during the 4.19 merge window.

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 14:03:34 -06:00
Tariq Toukan
21b172ee11 net/xdp: Fix suspicious RCU usage warning
Fix the warning below by calling rhashtable_lookup_fast.
Also, make some code movements for better quality and human
readability.

[  342.450870] WARNING: suspicious RCU usage
[  342.455856] 4.18.0-rc2+ #17 Tainted: G           O
[  342.462210] -----------------------------
[  342.467202] ./include/linux/rhashtable.h:481 suspicious rcu_dereference_check() usage!
[  342.476568]
[  342.476568] other info that might help us debug this:
[  342.476568]
[  342.486978]
[  342.486978] rcu_scheduler_active = 2, debug_locks = 1
[  342.495211] 4 locks held by modprobe/3934:
[  342.500265]  #0: 00000000e23116b2 (mlx5_intf_mutex){+.+.}, at:
mlx5_unregister_interface+0x18/0x90 [mlx5_core]
[  342.511953]  #1: 00000000ca16db96 (rtnl_mutex){+.+.}, at: unregister_netdev+0xe/0x20
[  342.521109]  #2: 00000000a46e2c4b (&priv->state_lock){+.+.}, at: mlx5e_close+0x29/0x60
[mlx5_core]
[  342.531642]  #3: 0000000060c5bde3 (mem_id_lock){+.+.}, at: xdp_rxq_info_unreg+0x93/0x6b0
[  342.541206]
[  342.541206] stack backtrace:
[  342.547075] CPU: 12 PID: 3934 Comm: modprobe Tainted: G           O      4.18.0-rc2+ #17
[  342.556621] Hardware name: Dell Inc. PowerEdge R730/0H21J3, BIOS 1.5.4 10/002/2015
[  342.565606] Call Trace:
[  342.568861]  dump_stack+0x78/0xb3
[  342.573086]  xdp_rxq_info_unreg+0x3f5/0x6b0
[  342.578285]  ? __call_rcu+0x220/0x300
[  342.582911]  mlx5e_free_rq+0x38/0xc0 [mlx5_core]
[  342.588602]  mlx5e_close_channel+0x20/0x120 [mlx5_core]
[  342.594976]  mlx5e_close_channels+0x26/0x40 [mlx5_core]
[  342.601345]  mlx5e_close_locked+0x44/0x50 [mlx5_core]
[  342.607519]  mlx5e_close+0x42/0x60 [mlx5_core]
[  342.613005]  __dev_close_many+0xb1/0x120
[  342.617911]  dev_close_many+0xa2/0x170
[  342.622622]  rollback_registered_many+0x148/0x460
[  342.628401]  ? __lock_acquire+0x48d/0x11b0
[  342.633498]  ? unregister_netdev+0xe/0x20
[  342.638495]  rollback_registered+0x56/0x90
[  342.643588]  unregister_netdevice_queue+0x7e/0x100
[  342.649461]  unregister_netdev+0x18/0x20
[  342.654362]  mlx5e_remove+0x2a/0x50 [mlx5_core]
[  342.659944]  mlx5_remove_device+0xe5/0x110 [mlx5_core]
[  342.666208]  mlx5_unregister_interface+0x39/0x90 [mlx5_core]
[  342.673038]  cleanup+0x5/0xbfc [mlx5_core]
[  342.678094]  __x64_sys_delete_module+0x16b/0x240
[  342.683725]  ? do_syscall_64+0x1c/0x210
[  342.688476]  do_syscall_64+0x5a/0x210
[  342.693025]  entry_SYSCALL_64_after_hwframe+0x49/0xbe

Fixes: 8d5d885275 ("xdp: rhashtable with allocator ID to pointer mapping")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Suggested-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: Jesper Dangaard Brouer <brouer@redhat.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-16 21:55:21 +02:00
Vlad Buslov
32039eac4c net: sched: act_ife: always release ife action on init error
Action init API was changed to always take reference to action, even when
overwriting existing action. Substitute conditional action release, which
was executed only if action is newly created, with unconditional release in
tcf_ife_init() error handling code to prevent double free or memory leak in
case of overwrite.

Fixes: 4e8ddd7f17 ("net: sched: don't release reference on action overwrite")
Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16 12:12:12 -07:00
Jason Gunthorpe
89982f7cce Linux 4.18
-----BEGIN PGP SIGNATURE-----
 
 iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAltwm2geHHRvcnZhbGRz
 QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGITkH/iSzkVhT2OxHoir0
 mLVzTi7/Z17L0e/ELl7TvAC0iLFlWZKdlGR0g3b4/QpXLPmNK4HxiDRTQuWn8ke0
 qDZyDq89HqLt+mpeFZ43PCd9oqV8CH2xxK3iCWReqv6bNnowGnRpSStlks4rDqWn
 zURC/5sUh7TzEG4s997RrrpnyPeQWUlf/Mhtzg2/WvK2btoLWgu5qzjX1uFh3s7u
 vaF2NXVJ3X03gPktyxZzwtO1SwLFS1jhwUXWBZ5AnoJ99ywkghQnkqS/2YpekNTm
 wFk80/78sU+d91aAqO8kkhHj8VRrd+9SGnZ4mB2aZHwjZjGcics4RRtxukSfOQ+6
 L47IdXo=
 =sJkt
 -----END PGP SIGNATURE-----

Merge tag 'v4.18' into rdma.git for-next

Resolve merge conflicts from the -rc cycle against the rdma.git tree:

Conflicts:
 drivers/infiniband/core/uverbs_cmd.c
  - New ifs added to ib_uverbs_ex_create_flow in -rc and for-next
  - Merge removal of file->ucontext in for-next with new code in -rc
 drivers/infiniband/core/uverbs_main.c
  - for-next removed code from ib_uverbs_write() that was modified
    in for-rc

Signed-off-by: Jason Gunthorpe <jgg@mellanox.com>
2018-08-16 13:12:00 -06:00
Hangbin Liu
a51c76b4df cls_matchall: fix tcf_unbind_filter missing
Fix tcf_unbind_filter missing in cls_matchall as this will trigger
WARN_ON() in cbq_destroy_class().

Fixes: fd62d9f5c5 ("net/sched: matchall: Fix configuration race")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-16 12:08:26 -07:00
Pablo Neira Ayuso
feb9f55c33 netfilter: nft_dynset: allow dynamic updates of non-anonymous set
This check is superfluous since it breaks valid configurations, remove it.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:11 +02:00
Máté Eckl
90d827f06b netfilter: nft_tproxy: Fix missing-braces warning
This patch fixes a warning reported by the kbuild test robot (from linux-next
tree):
   net/netfilter/nft_tproxy.c: In function 'nft_tproxy_eval_v6':
>> net/netfilter/nft_tproxy.c:85:9: warning: missing braces around initializer [-Wmissing-braces]
     struct in6_addr taddr = {0};
            ^
   net/netfilter/nft_tproxy.c:85:9: warning: (near initialization for 'taddr.in6_u') [-Wmissing-braces]

This warning is actually caused by a gcc bug already resolved in newer
versions (kbuild used 4.9) so this kind of initialization is omitted and
memset is used instead.

Fixes: 4ed8eb6570 ("netfilter: nf_tables: Add native tproxy support")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:10 +02:00
Harsha Sharma
3206c516ce netfilter: nft_ct: make l3 protocol field optional for timeout object
If l3 protocol value is not specified for ct timeout object then use the
value from nft_ctx protocol family.

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:08 +02:00
Michal Hocko
a148ce1537 netfilter: x_tables: do not fail xt_alloc_table_info too easilly
eacd86ca3b ("net/netfilter/x_tables.c: use kvmalloc()
in xt_alloc_table_info()") has unintentionally fortified
xt_alloc_table_info allocation when __GFP_RETRY has been dropped from
the vmalloc fallback. Later on there was a syzbot report that this
can lead to OOM killer invocations when tables are too large and
0537250fdc ("netfilter: x_tables: make allocation less aggressive")
has been merged to restore the original behavior. Georgi Nikolov however
noticed that he is not able to install his iptables anymore so this can
be seen as a regression.

The primary argument for 0537250fdc was that this allocation path
shouldn't really trigger the OOM killer and kill innocent tasks. On the
other hand the interface requires root and as such should allow what the
admin asks for. Root inside a namespaces makes this more complicated
because those might be not trusted in general. If they are not then such
namespaces should be restricted anyway. Therefore drop the __GFP_NORETRY
and replace it by __GFP_ACCOUNT to enfore memcg constrains on it.

Fixes: 0537250fdc ("netfilter: x_tables: make allocation less aggressive")
Reported-by: Georgi Nikolov <gnikolov@icdsoft.com>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Michal Hocko <mhocko@suse.com>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:05 +02:00
Florian Westphal
1c117d3b72 netfilter: conntrack: fix removal of conntrack entries when l4tracker is removed
nf_ct_l4proto_unregister_one() leaves conntracks added by
to-be-removed tracker behind, nf_ct_l4proto_unregister has to iterate
for each protocol to be removed.

v2: call nf_ct_iterate_destroy without holding nf_ct_proto_mutex.

Fixes: 2c41f33c1b ("netfilter: move table iteration out of netns exit paths")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:04 +02:00
Florian Westphal
6a48de0144 netfilter: nf_tables: don't prevent event handler from device cleanup on netns exit
When a netnsamespace exits, the nf_tables pernet_ops will remove all rules.
However, there is one caveat:

Base chains that register ingress hooks will cause use-after-free:
device is already gone at that point.

The device event handlers prevent this from happening:
netns exit synthesizes unregister events for all devices.

However, an improper fix for a race condition made the notifiers a no-op
in case they get called from netns exit path, so revert that part.

This is safe now as the previous patch fixed nf_tables pernet ops
and device notifier initialisation ordering.

Fixes: 0a2cf5ee43 ("netfilter: nf_tables: close race between netns exit and rmmod")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:03 +02:00
Florian Westphal
d209df3e7f netfilter: nf_tables: fix register ordering
We must register nfnetlink ops last, as that exposes nf_tables to
userspace.  Without this, we could theoretically get nfnetlink request
before net->nft state has been initialized.

Fixes: 99633ab29b ("netfilter: nf_tables: complete net namespace support")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:02 +02:00
Florian Westphal
3e673b23b5 netfilter: fix memory leaks on netlink_dump_start error
Shaochun Chen points out we leak dumper filter state allocations
stored in dump_control->data in case there is an error before netlink sets
cb_running (after which ->done will be called at some point).

In order to fix this, add .start functions and move allocations there.

Same pattern as used in commit 90fd131afc
("netfilter: nf_tables: move dumper state allocation into ->start").

Reported-by: shaochun chen <cscnull@gmail.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:37:00 +02:00
Taehee Yoo
4ef360dd6a netfilter: nft_set: fix allocation size overflow in privsize callback.
In order to determine allocation size of set, ->privsize is invoked.
At this point, both desc->size and size of each data structure of set
are used. desc->size means number of element that is given by user.
desc->size is u32 type. so that upperlimit of set element is 4294967295.
but return type of ->privsize is also u32. hence overflow can occurred.

test commands:
   %nft add table ip filter
   %nft add set ip filter hash1 { type ipv4_addr \; size 4294967295 \; }
   %nft list ruleset

splat looks like:
[ 1239.202910] kasan: CONFIG_KASAN_INLINE enabled
[ 1239.208788] kasan: GPF could be caused by NULL-ptr deref or user memory access
[ 1239.217625] general protection fault: 0000 [#1] SMP DEBUG_PAGEALLOC KASAN PTI
[ 1239.219329] CPU: 0 PID: 1603 Comm: nft Not tainted 4.18.0-rc5+ #7
[ 1239.229091] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
[ 1239.229091] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
[ 1239.229091] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
[ 1239.229091] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
[ 1239.229091] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
[ 1239.229091] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
[ 1239.229091] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
[ 1239.229091] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
[ 1239.229091] FS:  00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 1239.229091] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1239.229091] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
[ 1239.229091] Call Trace:
[ 1239.229091]  ? nft_hash_remove+0xf0/0xf0 [nf_tables_set]
[ 1239.229091]  ? memset+0x1f/0x40
[ 1239.229091]  ? __nla_reserve+0x9f/0xb0
[ 1239.229091]  ? memcpy+0x34/0x50
[ 1239.229091]  nf_tables_dump_set+0x9a1/0xda0 [nf_tables]
[ 1239.229091]  ? __kmalloc_reserve.isra.29+0x2e/0xa0
[ 1239.229091]  ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
[ 1239.229091]  ? nf_tables_commit+0x2c60/0x2c60 [nf_tables]
[ 1239.229091]  netlink_dump+0x470/0xa20
[ 1239.229091]  __netlink_dump_start+0x5ae/0x690
[ 1239.229091]  nft_netlink_dump_start_rcu+0xd1/0x160 [nf_tables]
[ 1239.229091]  nf_tables_getsetelem+0x2e5/0x4b0 [nf_tables]
[ 1239.229091]  ? nft_get_set_elem+0x440/0x440 [nf_tables]
[ 1239.229091]  ? nft_chain_hash_obj+0x630/0x630 [nf_tables]
[ 1239.229091]  ? nf_tables_dump_obj_done+0x70/0x70 [nf_tables]
[ 1239.229091]  ? nla_parse+0xab/0x230
[ 1239.229091]  ? nft_get_set_elem+0x440/0x440 [nf_tables]
[ 1239.229091]  nfnetlink_rcv_msg+0x7f0/0xab0 [nfnetlink]
[ 1239.229091]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[ 1239.229091]  ? debug_show_all_locks+0x290/0x290
[ 1239.229091]  ? sched_clock_cpu+0x132/0x170
[ 1239.229091]  ? find_held_lock+0x39/0x1b0
[ 1239.229091]  ? sched_clock_local+0x10d/0x130
[ 1239.229091]  netlink_rcv_skb+0x211/0x320
[ 1239.229091]  ? nfnetlink_bind+0x1d0/0x1d0 [nfnetlink]
[ 1239.229091]  ? netlink_ack+0x7b0/0x7b0
[ 1239.229091]  ? ns_capable_common+0x6e/0x110
[ 1239.229091]  nfnetlink_rcv+0x2d1/0x310 [nfnetlink]
[ 1239.229091]  ? nfnetlink_rcv_batch+0x10f0/0x10f0 [nfnetlink]
[ 1239.229091]  ? netlink_deliver_tap+0x829/0x930
[ 1239.229091]  ? lock_acquire+0x265/0x2e0
[ 1239.229091]  netlink_unicast+0x406/0x520
[ 1239.509725]  ? netlink_attachskb+0x5b0/0x5b0
[ 1239.509725]  ? find_held_lock+0x39/0x1b0
[ 1239.509725]  netlink_sendmsg+0x987/0xa20
[ 1239.509725]  ? netlink_unicast+0x520/0x520
[ 1239.509725]  ? _copy_from_user+0xa9/0xc0
[ 1239.509725]  __sys_sendto+0x21a/0x2c0
[ 1239.509725]  ? __ia32_sys_getpeername+0xa0/0xa0
[ 1239.509725]  ? retint_kernel+0x10/0x10
[ 1239.509725]  ? sched_clock_cpu+0x132/0x170
[ 1239.509725]  ? find_held_lock+0x39/0x1b0
[ 1239.509725]  ? lock_downgrade+0x540/0x540
[ 1239.509725]  ? up_read+0x1c/0x100
[ 1239.509725]  ? __do_page_fault+0x763/0x970
[ 1239.509725]  ? retint_user+0x18/0x18
[ 1239.509725]  __x64_sys_sendto+0x177/0x180
[ 1239.509725]  do_syscall_64+0xaa/0x360
[ 1239.509725]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 1239.509725] RIP: 0033:0x7f5a8f468e03
[ 1239.509725] Code: 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb d0 0f 1f 84 00 00 00 00 00 83 3d 49 c9 2b 00 00 75 13 49 89 ca b8 2c 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8
[ 1239.509725] RSP: 002b:00007ffd78d0b778 EFLAGS: 00000246 ORIG_RAX: 000000000000002c
[ 1239.509725] RAX: ffffffffffffffda RBX: 00007ffd78d0c890 RCX: 00007f5a8f468e03
[ 1239.509725] RDX: 0000000000000034 RSI: 00007ffd78d0b7e0 RDI: 0000000000000003
[ 1239.509725] RBP: 00007ffd78d0b7d0 R08: 00007f5a8f15c160 R09: 000000000000000c
[ 1239.509725] R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffd78d0b7e0
[ 1239.509725] R13: 0000000000000034 R14: 00007f5a8f9aff60 R15: 00005648040094b0
[ 1239.509725] Modules linked in: nf_tables_set nf_tables nfnetlink ip_tables x_tables
[ 1239.670713] ---[ end trace 39375adcda140f11 ]---
[ 1239.676016] RIP: 0010:nft_hash_walk+0x1d2/0x310 [nf_tables_set]
[ 1239.682834] Code: 84 d2 7f 10 4c 89 e7 89 44 24 38 e8 d8 5a 17 e0 8b 44 24 38 48 8d 7b 10 41 0f b6 0c 24 48 89 fa 48 89 fe 48 c1 ea 03 83 e6 07 <42> 0f b6 14 3a 40 38 f2 7f 1a 84 d2 74 16
[ 1239.705108] RSP: 0018:ffff8801118cf358 EFLAGS: 00010246
[ 1239.711115] RAX: 0000000000000000 RBX: 0000000000020400 RCX: 0000000000000001
[ 1239.719269] RDX: 0000000000004082 RSI: 0000000000000000 RDI: 0000000000020410
[ 1239.727401] RBP: ffff880114d5a988 R08: 0000000000007e94 R09: ffff880114dd8030
[ 1239.735530] R10: ffff880114d5a988 R11: ffffed00229bb006 R12: ffff8801118cf4d0
[ 1239.743658] R13: ffff8801118cf4d8 R14: 0000000000000000 R15: dffffc0000000000
[ 1239.751785] FS:  00007f5a8fe0b700(0000) GS:ffff88011b600000(0000) knlGS:0000000000000000
[ 1239.760993] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1239.767560] CR2: 00007f5a8ecc27b0 CR3: 000000010608e000 CR4: 00000000001006f0
[ 1239.775679] Kernel panic - not syncing: Fatal exception
[ 1239.776630] Kernel Offset: 0x1f000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 1239.776630] Rebooting in 5 seconds..

Fixes: 20a69341f2 ("netfilter: nf_tables: add netlink set API")
Signed-off-by: Taehee Yoo <ap420073@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:36:59 +02:00
Florian Westphal
da786717e0 netfilter: ip6t_rpfilter: set F_IFACE for linklocal addresses
Roman reports that DHCPv6 client no longer sees replies from server
due to

ip6tables -t raw -A PREROUTING -m rpfilter --invert -j DROP

rule.  We need to set the F_IFACE flag for linklocal addresses, they
are scoped per-device.

Fixes: 47b7e7f828 ("netfilter: don't set F_IFACE on ipv6 fib lookups")
Reported-by: Roman Mamedov <rm@romanrm.net>
Tested-by: Roman Mamedov <rm@romanrm.net>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:36:58 +02:00
Matteo Croce
b71ed54dc2 ipvs: don't show negative times in ip_vs_conn
Since commit 500462a9de ("timers: Switch to a non-cascading wheel"),
timers duration can last even 12.5% more than the scheduled interval.

IPVS has two handlers, /proc/net/ip_vs_conn and /proc/net/ip_vs_conn_sync,
which shows the remaining time before that a connection expires.
The default expire time for a connection is 60 seconds, and the
expiration timer can fire even 4 seconds later than the scheduled time.
The expiration time is calculated subtracting jiffies to the scheduled
expiration time, and it's shown as a huge number when the timer fires late,
since both values are unsigned.

This can confuse script and tools which relies on it, like ipvsadm:

    root@mcroce-redhat:~# while ipvsadm -lc |grep SYN_RECV; do sleep 1 ; done
    TCP 00:05  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 00:04  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 00:03  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 00:02  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 00:01  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 00:00  SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 68719476:44 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 68719476:43 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 68719476:42 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 68719476:41 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 68719476:40 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000
    TCP 68719476:39 SYN_RECV    [fc00:1::1]:55732  [fc00:1::2]:8000   [fc00:2000::1]:8000

Signed-off-by: Matteo Croce <mcroce@redhat.com>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:36:57 +02:00
Tan Hu
a53b42c118 ipvs: fix race between ip_vs_conn_new() and ip_vs_del_dest()
We came across infinite loop in ipvs when using ipvs in docker
env.

When ipvs receives new packets and cannot find an ipvs connection,
it will create a new connection, then if the dest is unavailable
(i.e. IP_VS_DEST_F_AVAILABLE), the packet will be dropped sliently.

But if the dropped packet is the first packet of this connection,
the connection control timer never has a chance to start and the
ipvs connection cannot be released. This will lead to memory leak, or
infinite loop in cleanup_net() when net namespace is released like
this:

    ip_vs_conn_net_cleanup at ffffffffa0a9f31a [ip_vs]
    __ip_vs_cleanup at ffffffffa0a9f60a [ip_vs]
    ops_exit_list at ffffffff81567a49
    cleanup_net at ffffffff81568b40
    process_one_work at ffffffff810a851b
    worker_thread at ffffffff810a9356
    kthread at ffffffff810b0b6f
    ret_from_fork at ffffffff81697a18

race condition:
    CPU1                           CPU2
    ip_vs_in()
      ip_vs_conn_new()
                                   ip_vs_del_dest()
                                     __ip_vs_unlink_dest()
                                       ~IP_VS_DEST_F_AVAILABLE
      cp->dest && !IP_VS_DEST_F_AVAILABLE
      __ip_vs_conn_put
    ...
    cleanup_net  ---> infinite looping

Fix this by checking whether the timer already started.

Signed-off-by: Tan Hu <tan.hu@zte.com.cn>
Reviewed-by: Jiang Biao <jiang.biao2@zte.com.cn>
Acked-by: Julian Anastasov <ja@ssi.bg>
Acked-by: Simon Horman <horms@verge.net.au>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-16 19:36:51 +02:00
Linus Torvalds
dafa5f6577 Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
Pull crypto updates from Herbert Xu:
 "API:
   - Fix dcache flushing crash in skcipher.
   - Add hash finup self-tests.
   - Reschedule during speed tests.

  Algorithms:
   - Remove insecure vmac and replace it with vmac64.
   - Add public key verification for DH/ECDH.

  Drivers:
   - Decrease priority of sha-mb on x86.
   - Improve NEON latency/throughput on ARM64.
   - Add md5/sha384/sha512/des/3des to inside-secure.
   - Support eip197d in inside-secure.
   - Only register algorithms supported by the host in virtio.
   - Add cts and remove incompatible cts1 from ccree.
   - Add hisilicon SEC security accelerator driver.
   - Replace msm hwrng driver with qcom pseudo rng driver.

  Misc:
   - Centralize CRC polynomials"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6: (121 commits)
  crypto: arm64/ghash-ce - implement 4-way aggregation
  crypto: arm64/ghash-ce - replace NEON yield check with block limit
  crypto: hisilicon - sec_send_request() can be static
  lib/mpi: remove redundant variable esign
  crypto: arm64/aes-ce-gcm - don't reload key schedule if avoidable
  crypto: arm64/aes-ce-gcm - implement 2-way aggregation
  crypto: arm64/aes-ce-gcm - operate on two input blocks at a time
  crypto: dh - make crypto_dh_encode_key() make robust
  crypto: dh - fix calculating encoded key size
  crypto: ccp - Check for NULL PSP pointer at module unload
  crypto: arm/chacha20 - always use vrev for 16-bit rotates
  crypto: ccree - allow bigger than sector XTS op
  crypto: ccree - zero all of request ctx before use
  crypto: ccree - remove cipher ivgen left overs
  crypto: ccree - drop useless type flag during reg
  crypto: ablkcipher - fix crash flushing dcache in error path
  crypto: blkcipher - fix crash flushing dcache in error path
  crypto: skcipher - fix crash flushing dcache in error path
  crypto: skcipher - remove unnecessary setting of walk->nbytes
  crypto: scatterwalk - remove scatterwalk_samebuf()
  ...
2018-08-15 16:01:47 -07:00
Linus Torvalds
9a76aba02a Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next
Pull networking updates from David Miller:
 "Highlights:

   - Gustavo A. R. Silva keeps working on the implicit switch fallthru
     changes.

   - Support 802.11ax High-Efficiency wireless in cfg80211 et al, From
     Luca Coelho.

   - Re-enable ASPM in r8169, from Kai-Heng Feng.

   - Add virtual XFRM interfaces, which avoids all of the limitations of
     existing IPSEC tunnels. From Steffen Klassert.

   - Convert GRO over to use a hash table, so that when we have many
     flows active we don't traverse a long list during accumluation.

   - Many new self tests for routing, TC, tunnels, etc. Too many
     contributors to mention them all, but I'm really happy to keep
     seeing this stuff.

   - Hardware timestamping support for dpaa_eth/fsl-fman from Yangbo Lu.

   - Lots of cleanups and fixes in L2TP code from Guillaume Nault.

   - Add IPSEC offload support to netdevsim, from Shannon Nelson.

   - Add support for slotting with non-uniform distribution to netem
     packet scheduler, from Yousuk Seung.

   - Add UDP GSO support to mlx5e, from Boris Pismenny.

   - Support offloading of Team LAG in NFP, from John Hurley.

   - Allow to configure TX queue selection based upon RX queue, from
     Amritha Nambiar.

   - Support ethtool ring size configuration in aquantia, from Anton
     Mikaev.

   - Support DSCP and flowlabel per-transport in SCTP, from Xin Long.

   - Support list based batching and stack traversal of SKBs, this is
     very exciting work. From Edward Cree.

   - Busyloop optimizations in vhost_net, from Toshiaki Makita.

   - Introduce the ETF qdisc, which allows time based transmissions. IGB
     can offload this in hardware. From Vinicius Costa Gomes.

   - Add parameter support to devlink, from Moshe Shemesh.

   - Several multiplication and division optimizations for BPF JIT in
     nfp driver, from Jiong Wang.

   - Lots of prepatory work to make more of the packet scheduler layer
     lockless, when possible, from Vlad Buslov.

   - Add ACK filter and NAT awareness to sch_cake packet scheduler, from
     Toke Høiland-Jørgensen.

   - Support regions and region snapshots in devlink, from Alex Vesker.

   - Allow to attach XDP programs to both HW and SW at the same time on
     a given device, with initial support in nfp. From Jakub Kicinski.

   - Add TLS RX offload and support in mlx5, from Ilya Lesokhin.

   - Use PHYLIB in r8169 driver, from Heiner Kallweit.

   - All sorts of changes to support Spectrum 2 in mlxsw driver, from
     Ido Schimmel.

   - PTP support in mv88e6xxx DSA driver, from Andrew Lunn.

   - Make TCP_USER_TIMEOUT socket option more accurate, from Jon
     Maxwell.

   - Support for templates in packet scheduler classifier, from Jiri
     Pirko.

   - IPV6 support in RDS, from Ka-Cheong Poon.

   - Native tproxy support in nf_tables, from Máté Eckl.

   - Maintain IP fragment queue in an rbtree, but optimize properly for
     in-order frags. From Peter Oskolkov.

   - Improvde handling of ACKs on hole repairs, from Yuchung Cheng"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1996 commits)
  bpf: test: fix spelling mistake "REUSEEPORT" -> "REUSEPORT"
  hv/netvsc: Fix NULL dereference at single queue mode fallback
  net: filter: mark expected switch fall-through
  xen-netfront: fix warn message as irq device name has '/'
  cxgb4: Add new T5 PCI device ids 0x50af and 0x50b0
  net: dsa: mv88e6xxx: missing unlock on error path
  rds: fix building with IPV6=m
  inet/connection_sock: prefer _THIS_IP_ to current_text_addr
  net: dsa: mv88e6xxx: bitwise vs logical bug
  net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
  ieee802154: hwsim: using right kind of iteration
  net: hns3: Add vlan filter setting by ethtool command -K
  net: hns3: Set tx ring' tc info when netdev is up
  net: hns3: Remove tx ring BD len register in hns3_enet
  net: hns3: Fix desc num set to default when setting channel
  net: hns3: Fix for phy link issue when using marvell phy driver
  net: hns3: Fix for information of phydev lost problem when down/up
  net: hns3: Fix for command format parsing error in hclge_is_all_function_id_zero
  net: hns3: Add support for serdes loopback selftest
  bnxt_en: take coredump_record structure off stack
  ...
2018-08-15 15:04:25 -07:00
Linus Torvalds
e026bcc561 Kbuild updates for v4.19
- verify depmod is installed before modules_install
 
 - support build salt in case build ids must be unique between builds
 
 - allow users to specify additional host compiler flags via HOST*FLAGS,
   and rename internal variables to KBUILD_HOST*FLAGS
 
 - update buildtar script to drop vax support, add arm64 support
 
 - update builddeb script for better debarch support
 
 - document the pit-fall of if_changed usage
 
 - fix parallel build of UML with O= option
 
 - make 'samples' target depend on headers_install to fix build errors
 
 - remove deprecated host-progs variable
 
 - add a new coccinelle script for refcount_t vs atomic_t check
 
 - improve double-test coccinelle script
 
 - misc cleanups and fixes
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJbdFZ0AAoJED2LAQed4NsGcHYP/23txxk3GRP7O4UkfPw9Rtky
 MHiXTgcoy2vbG+l12BgzWX+qFii8XTUe3dQtK4HnGQFUIBtEBV/hpZPJtxfgGSev
 Zou5cv1kr5rNzTkCn//TG3O6/WIkTBCe2hahDCtmGDI3kd/cPK4dHbU/q6KpaqIJ
 qzZYBXIvCeu2GM8idQoCRrwdMpgu1pBz1gz2sDje1yHH2toI7T6cXHRLQDgx+HPq
 LIP7W9GUsoDdXjecvPD51LiW89E6BUxETBh5Ft9r9uzwB5ylQQMcw6Qyu2DiYDUX
 PPsHCMiolYV+Ttcy+vj/67KOvKmEaFotssck+RD/xDCF17zKhRkup+YM8kPLHTVZ
 TcAUZadbnT6U/s2W6GFwvVbN/P7cc3aif+aNCC/Pl23yagp3pydlSCocYxQgiVR7
 /rx48haYDEgu/MJ1X0dOpSO0ErY7zu2OoAlNerW+D9QizwbP+WtZO/CJH8SxQRuN
 dQ1xmyNrie+ODgi9tbc4eBrsb+1rioX927TP5MbJcfXt5CTsxDmIqop5XwyYIoQN
 ZWWlzC8Ii3P2trAVpBgM2IEbngSxwr6T9Wbf1ScJnPKr/o1rq+pBk49cYstTz3kQ
 OwJ8gPwUrkW4R+hlD7L6mL/WcrKzZBQS0Ij1QW2kVSEhRrsKo99psE1/rGehnHu9
 KGB0LYYCqGSOHR4zOjg0
 =VjfG
 -----END PGP SIGNATURE-----

Merge tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild

Pull Kbuild updates from Masahiro Yamada:

 - verify depmod is installed before modules_install

 - support build salt in case build ids must be unique between builds

 - allow users to specify additional host compiler flags via HOST*FLAGS,
   and rename internal variables to KBUILD_HOST*FLAGS

 - update buildtar script to drop vax support, add arm64 support

 - update builddeb script for better debarch support

 - document the pit-fall of if_changed usage

 - fix parallel build of UML with O= option

 - make 'samples' target depend on headers_install to fix build errors

 - remove deprecated host-progs variable

 - add a new coccinelle script for refcount_t vs atomic_t check

 - improve double-test coccinelle script

 - misc cleanups and fixes

* tag 'kbuild-v4.19' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (41 commits)
  coccicheck: return proper error code on fail
  Coccinelle: doubletest: reduce side effect false positives
  kbuild: remove deprecated host-progs variable
  kbuild: make samples really depend on headers_install
  um: clean up archheaders recipe
  kbuild: add %asm-generic to no-dot-config-targets
  um: fix parallel building with O= option
  scripts: Add Python 3 support to tracing/draw_functrace.py
  builddeb: Add automatic support for sh{3,4}{,eb} architectures
  builddeb: Add automatic support for riscv* architectures
  builddeb: Add automatic support for m68k architecture
  builddeb: Add automatic support for or1k architecture
  builddeb: Add automatic support for sparc64 architecture
  builddeb: Add automatic support for mips{,64}r6{,el} architectures
  builddeb: Add automatic support for mips64el architecture
  builddeb: Add automatic support for ppc64 and powerpcspe architectures
  builddeb: Introduce functions to simplify kconfig tests in set_debarch
  builddeb: Drop check for 32-bit s390
  builddeb: Change architecture detection fallback to use dpkg-architecture
  builddeb: Skip architecture detection when KBUILD_DEBARCH is set
  ...
2018-08-15 12:09:03 -07:00
Linus Torvalds
8c32685030 audit/stable-4.18 PR 20180814
-----BEGIN PGP SIGNATURE-----
 
 iQJIBAABCAAyFiEEcQCq365ubpQNLgrWVeRaWujKfIoFAltzOI8UHHBhdWxAcGF1
 bC1tb29yZS5jb20ACgkQVeRaWujKfIqvOw/9F48G0G87ETN2M1O3fFLpANd2GtPH
 cYtg6SG8b3nZzkllUBOVZjoj7y0OyTgi+ksm5XSmgcR4npELDHWRhXaCma1s8Hxw
 Zpk2SWrEmwdudec9zZ8sBtvd6t63P9TkVSk3S4z+1HgH6uzPg6lMes4w/cQIdn0e
 uEIiuNkOOHXxWLxgphxPJ/XffQCfV6ltm9j41Z2S6RZHe/0UsJSKCwD+LwW+jLe3
 0e9oUPWVnxZlhpJoctWpgtaYmG8rtqxfXruFwg2mR5d2VBs966h3FCz9b9qhtA+R
 HPy/KagJuUFRL+6ZlpX1dtkkNi07LKjsx3OJYBNHNKmYDA5yJF3X/mMV4uMkroFz
 q8k88Wi1dYWKJvLhAGALRGSbiYljLgDFJH31tN2FWHb93+l3CXLhjmwIyUxrtiW1
 DfUJYM8/ZzMjghpb6YoHi1Dq+Z1cG07S1Eo0pTmtxVT2g43eXSL4OvxFhJKkCFv9
 n1J/fo6JVw9i+4DqWkk+8NFGXYYZFlfrGl14szTp+8Q9Gw6OlWWcnF/o3+KCtPez
 5z5+E64pAHhLa9/gaHu62eK4lIpiXv3IE8fA7OvFxrUpSJbtYc0RnGv8eKhO8J6v
 Q9TQKJGbbOLwP+mF/8sPexsREDaU3fB68dTj/FF5AHLH8KFUgAQYcm0bIzpNDBSd
 w+REW+TTiht8W6E=
 =tnPz
 -----END PGP SIGNATURE-----

Merge tag 'audit-pr-20180814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit

Pull audit patches from Paul Moore:
 "Twelve audit patches for v4.19 and they run the full gamut from fixes
  to features.

  Notable changes include the ability to use the "exe" audit filter
  field in a wider variety of filter types, a fix for our comparison of
  GID/EGID in audit filter rules, better association of related audit
  records (connecting related audit records together into one audit
  event), and a fix for a potential use-after-free in audit_add_watch().

  All the patches pass the audit-testsuite and merge cleanly on your
  current master branch"

* tag 'audit-pr-20180814' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
  audit: fix use-after-free in audit_add_watch
  audit: use ktime_get_coarse_real_ts64() for timestamps
  audit: use ktime_get_coarse_ts64() for time access
  audit: simplify audit_enabled check in audit_watch_log_rule_change()
  audit: check audit_enabled in audit_tree_log_remove_rule()
  cred: conditionally declare groups-related functions
  audit: eliminate audit_enabled magic number comparison
  audit: rename FILTER_TYPE to FILTER_EXCLUDE
  audit: Fix extended comparison of GID/EGID
  audit: tie ANOM_ABEND records to syscall
  audit: tie SECCOMP records to syscall
  audit: allow other filter list types for AUDIT_EXE
2018-08-15 10:46:54 -07:00
Linus Torvalds
c07b3682cd LED updates for 4.19-rc1
-----BEGIN PGP SIGNATURE-----
 
 iJEEABYIADkWIQQUwxxKyE5l/npt8ARiEGxRG/Sl2wUCW3HajRscamFjZWsuYW5h
 c3pld3NraUBnbWFpbC5jb20ACgkQYhBsURv0pdvxAwEA+qS5O9ByxlhT+BUC4ck6
 nIy0ITOCXP8ySoo8VVhzjikBAPrb9lFYGvHqzKN4dYtnSILPmlTSf1t1flng2Zev
 NfoE
 =lNwq
 -----END PGP SIGNATURE-----

Merge tag 'leds-for-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds

Pull LED updates from Jacek Anaszewski:
 "LED triggers improvements make the biggest part of this pull request.
  The most striking ones, that allowed for nice cleanups in the triggers
  are:

   - centralized handling of creation and removal of trigger sysfs
     attributes via attribute group

   - addition of module_led_trigger() helper

  The other things that need to be mentioned:

  New features and improvements to existing LED class drivers:

   - lt3593: add DT support, switch to gpiod interface

   - lm3692x: support LED sync configuration, change OF calls to fwnode
     calls

   - apu: modify PC Engines apu/apu2 driver to support apu3

  Change in the drivers/net/can/led.c:

   - mark led trigger as broken since it's in the way for the further
     cleanups. It implements a subset of the netdev trigger and an Ack
     is needed from someone who can actually test and confirm that the
     netdev trigger works for can devices"

* tag 'leds-for-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/j.anaszewski/linux-leds: (32 commits)
  leds: ns2: Change unsigned to unsigned int
  usb: simplify usbport trigger
  leds: gpio trigger: simplifications from core changes
  leds: backlight trigger: simplifications from core changes
  leds: activity trigger: simplifications from core changes
  leds: default-on trigger: make use of module_led_trigger()
  leds: heartbeat trigger: simplifications from core changes
  leds: oneshot trigger: simplifications from core changes
  leds: transient trigger: simplifications from core changes
  leds: timer trigger: simplifications from core changes
  leds: netdev trigger: simplifications from core changes
  leds: triggers: new function led_set_trigger_data()
  leds: triggers: define module_led_trigger helper
  leds: triggers: handle .trigger_data and .activated() in the core
  leds: triggers: add device attribute support
  leds: triggers: let struct led_trigger::activate() return an error code
  leds: triggers: make the MODULE_LICENSE string match the actual license
  leds: lm3692x: Support LED sync configuration
  dt: bindings: lm3692x: Update binding for LED sync control
  leds: lm3692x: Change DT calls to fwnode calls
  ...
2018-08-14 13:07:22 -07:00
Gustavo A. R. Silva
4597b62f7a net: filter: mark expected switch fall-through
In preparation to enabling -Wimplicit-fallthrough, mark switch cases
where we are expecting to fall through.

Addresses-Coverity-ID: 1472592 ("Missing break in switch")
Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-14 10:11:26 -07:00
Arnd Bergmann
f6f7a26abd rds: fix building with IPV6=m
When CONFIG_RDS_TCP is built-in and CONFIG_IPV6 is a loadable
module, we get a link error agains the modular ipv6_chk_addr()
function:

net/rds/tcp.o: In function `rds_tcp_laddr_check':
tcp.c:(.text+0x3b2): undefined reference to `ipv6_chk_addr'

This adds back a dependency that forces RDS_TCP to also be
a loadable module when IPV6 is one.

Fixes: e65d4d9633 ("rds: Remove IPv6 dependency")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-14 10:06:37 -07:00
Jeremy Cline
66b51b0a03 net: sock_diag: Fix spectre v1 gadget in __sock_diag_cmd()
req->sdiag_family is a user-controlled value that's used as an array
index. Sanitize it after the bounds check to avoid speculative
out-of-bounds array access.

This also protects the sock_is_registered() call, so this removes the
sanitize call there.

Fixes: e978de7a6d ("net: socket: Fix potential spectre v1 gadget in sock_is_registered")
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: konrad.wilk@oracle.com
Cc: jamie.iles@oracle.com
Cc: liran.alon@oracle.com
Cc: stable@vger.kernel.org
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-14 10:01:24 -07:00
Toke Høiland-Jørgensen
77cfaf52ec mac80211: Run TXQ teardown code before de-registering interfaces
The TXQ teardown code can reference the vif data structures that are
stored in the netdev private memory area if there are still packets on
the queue when it is being freed. Since the TXQ teardown code is run
after the netdevs are freed, this can lead to a use-after-free. Fix this
by moving the TXQ teardown code to earlier in ieee80211_unregister_hw().

Reported-by: Ben Greear <greearb@candelatech.com>
Tested-by: Ben Greear <greearb@candelatech.com>
Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-14 14:00:07 +02:00
Arnd Bergmann
f623f75ae4 rfkill-gpio: include linux/mod_devicetable.h
One more driver is apparently broken by the recent change
to linux/platform_device.h:

net/rfkill/rfkill-gpio.c: In function 'rfkill_gpio_acpi_probe':
net/rfkill/rfkill-gpio.c:82:29: error: dereferencing pointer to incomplete type 'const struct acpi_device_id'

Include linux/mod_devicetable.h to get the definition of the
acpi_device_id structure.

Fixes: ac3167257b ("headers: separate linux/mod_devicetable.h from linux/platform_device.h")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2018-08-14 13:59:28 +02:00
Arnd Bergmann
c2ebc25674 l2tp: fix unused function warning
Removing one of the callers of pppol2tp_session_get_sock caused a harmless
warning in some configurations:

net/l2tp/l2tp_ppp.c:142:21: 'pppol2tp_session_get_sock' defined but not used [-Wunused-function]

Rather than adding another #ifdef here, using a proper IS_ENABLED()
check makes the code more readable and avoids those warnings while
letting the compiler figure out for itself which code is needed.

This adds one pointer for the unused show() callback in struct
l2tp_session, but that seems harmless.

Fixes: b0e29063dc ("l2tp: remove pppol2tp_session_ioctl()")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 20:45:49 -07:00
Linus Torvalds
a66b4cd1e7 Merge branch 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs open-related updates from Al Viro:

 - "do we need fput() or put_filp()" rules are gone - it's always fput()
   now. We keep track of that state where it belongs - in ->f_mode.

 - int *opened mess killed - in finish_open(), in ->atomic_open()
   instances and in fs/namei.c code around do_last()/lookup_open()/atomic_open().

 - alloc_file() wrappers with saner calling conventions are introduced
   (alloc_file_clone() and alloc_file_pseudo()); callers converted, with
   much simplification.

 - while we are at it, saner calling conventions for path_init() and
   link_path_walk(), simplifying things inside fs/namei.c (both on
   open-related paths and elsewhere).

* 'work.open3' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (40 commits)
  few more cleanups of link_path_walk() callers
  allow link_path_walk() to take ERR_PTR()
  make path_init() unconditionally paired with terminate_walk()
  document alloc_file() changes
  make alloc_file() static
  do_shmat(): grab shp->shm_file earlier, switch to alloc_file_clone()
  new helper: alloc_file_clone()
  create_pipe_files(): switch the first allocation to alloc_file_pseudo()
  anon_inode_getfile(): switch to alloc_file_pseudo()
  hugetlb_file_setup(): switch to alloc_file_pseudo()
  ocxlflash_getfile(): switch to alloc_file_pseudo()
  cxl_getfile(): switch to alloc_file_pseudo()
  ... and switch shmem_file_setup() to alloc_file_pseudo()
  __shmem_file_setup(): reorder allocations
  new wrapper: alloc_file_pseudo()
  kill FILE_{CREATED,OPENED}
  switch atomic_open() and lookup_open() to returning 0 in all success cases
  document ->atomic_open() changes
  ->atomic_open(): return 0 in all success cases
  get rid of 'opened' in path_openat() and the helpers downstream
  ...
2018-08-13 19:58:36 -07:00
Hangbin Liu
008369dcc5 net_sched: Fix missing res info when create new tc_index filter
Li Shuang reported the following warn:

[  733.484610] WARNING: CPU: 6 PID: 21123 at net/sched/sch_cbq.c:1418 cbq_destroy_class+0x5d/0x70 [sch_cbq]
[  733.495190] Modules linked in: sch_cbq cls_tcindex sch_dsmark rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat l
[  733.574155]  syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm igb ixgbe ahci libahci i2c_algo_bit libata i40e i2c_core dca mdio megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[  733.592500] CPU: 6 PID: 21123 Comm: tc Not tainted 4.18.0-rc8.latest+ #131
[  733.600169] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016
[  733.608518] RIP: 0010:cbq_destroy_class+0x5d/0x70 [sch_cbq]
[  733.614734] Code: e7 d9 d2 48 8b 7b 48 e8 61 05 da d2 48 8d bb f8 00 00 00 e8 75 ae d5 d2 48 39 eb 74 0a 48 89 df 5b 5d e9 16 6c 94 d2 5b 5d c3 <0f> 0b eb b6 0f 1f 44 00 00 66 2e 0f 1f 84
[  733.635798] RSP: 0018:ffffbfbb066bb9d8 EFLAGS: 00010202
[  733.641627] RAX: 0000000000000001 RBX: ffff9cdd17392800 RCX: 000000008010000f
[  733.649588] RDX: ffff9cdd1df547e0 RSI: ffff9cdd17392800 RDI: ffff9cdd0f84c800
[  733.657547] RBP: ffff9cdd0f84c800 R08: 0000000000000001 R09: 0000000000000000
[  733.665508] R10: ffff9cdd0f84d000 R11: 0000000000000001 R12: 0000000000000001
[  733.673469] R13: 0000000000000000 R14: 0000000000000001 R15: ffff9cdd17392200
[  733.681430] FS:  00007f911890a740(0000) GS:ffff9cdd1f8c0000(0000) knlGS:0000000000000000
[  733.690456] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  733.696864] CR2: 0000000000b5544c CR3: 0000000859374002 CR4: 00000000001606e0
[  733.704826] Call Trace:
[  733.707554]  cbq_destroy+0xa1/0xd0 [sch_cbq]
[  733.712318]  qdisc_destroy+0x62/0x130
[  733.716401]  dsmark_destroy+0x2a/0x70 [sch_dsmark]
[  733.721745]  qdisc_destroy+0x62/0x130
[  733.725829]  qdisc_graft+0x3ba/0x470
[  733.729817]  tc_get_qdisc+0x2a6/0x2c0
[  733.733901]  ? cred_has_capability+0x7d/0x130
[  733.738761]  rtnetlink_rcv_msg+0x263/0x2d0
[  733.743330]  ? rtnl_calcit.isra.30+0x110/0x110
[  733.748287]  netlink_rcv_skb+0x4d/0x130
[  733.752576]  netlink_unicast+0x1a3/0x250
[  733.756949]  netlink_sendmsg+0x2ae/0x3a0
[  733.761324]  sock_sendmsg+0x36/0x40
[  733.765213]  ___sys_sendmsg+0x26f/0x2d0
[  733.769493]  ? handle_pte_fault+0x586/0xdf0
[  733.774158]  ? __handle_mm_fault+0x389/0x500
[  733.778919]  ? __sys_sendmsg+0x5e/0xa0
[  733.783099]  __sys_sendmsg+0x5e/0xa0
[  733.787087]  do_syscall_64+0x5b/0x180
[  733.791171]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  733.796805] RIP: 0033:0x7f9117f23f10
[  733.800791] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8
[  733.821873] RSP: 002b:00007ffe96818398 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[  733.830319] RAX: ffffffffffffffda RBX: 000000005b71244c RCX: 00007f9117f23f10
[  733.838280] RDX: 0000000000000000 RSI: 00007ffe968183e0 RDI: 0000000000000003
[  733.846241] RBP: 00007ffe968183e0 R08: 000000000000ffff R09: 0000000000000003
[  733.854202] R10: 00007ffe96817e20 R11: 0000000000000246 R12: 0000000000000000
[  733.862161] R13: 0000000000662ee0 R14: 0000000000000000 R15: 0000000000000000
[  733.870121] ---[ end trace 28edd4aad712ddca ]---

This is because we didn't update f->result.res when create new filter. Then in
tcindex_delete() -> tcf_unbind_filter(), we will failed to find out the res
and unbind filter, which will trigger the WARN_ON() in cbq_destroy_class().

Fix it by updating f->result.res when create new filter.

Fixes: 6e0565697a ("net_sched: fix another crash in cls_tcindex")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 19:37:42 -07:00
Hangbin Liu
2df8bee565 net_sched: fix NULL pointer dereference when delete tcindex filter
Li Shuang reported the following crash:

[   71.267724] BUG: unable to handle kernel NULL pointer dereference at 0000000000000004
[   71.276456] PGD 800000085d9bd067 P4D 800000085d9bd067 PUD 859a0b067 PMD 0
[   71.284127] Oops: 0000 [#1] SMP PTI
[   71.288015] CPU: 12 PID: 2386 Comm: tc Not tainted 4.18.0-rc8.latest+ #131
[   71.295686] Hardware name: Dell Inc. PowerEdge R730/0WCJNT, BIOS 2.1.5 04/11/2016
[   71.304037] RIP: 0010:tcindex_delete+0x72/0x280 [cls_tcindex]
[   71.310446] Code: 00 31 f6 48 87 75 20 48 85 f6 74 11 48 8b 47 18 48 8b 40 08 48 8b 40 50 e8 fb a6 f8 fc 48 85 db 0f 84 dc 00 00 00 48 8b 73 18 <8b> 56 04 48 8d 7e 04 85 d2 0f 84 7b 01 00
[   71.331517] RSP: 0018:ffffb45207b3f898 EFLAGS: 00010282
[   71.337345] RAX: ffff8ad3d72d6360 RBX: ffff8acc84393680 RCX: 000000000000002e
[   71.345306] RDX: ffff8ad3d72c8570 RSI: 0000000000000000 RDI: ffff8ad847a45800
[   71.353277] RBP: ffff8acc84393688 R08: ffff8ad3d72c8400 R09: 0000000000000000
[   71.361238] R10: ffff8ad3de786e00 R11: 0000000000000000 R12: ffffb45207b3f8c7
[   71.369199] R13: ffff8ad3d93bd2a0 R14: 000000000000002e R15: ffff8ad3d72c9600
[   71.377161] FS:  00007f9d3ec3e740(0000) GS:ffff8ad3df980000(0000) knlGS:0000000000000000
[   71.386188] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   71.392597] CR2: 0000000000000004 CR3: 0000000852f06003 CR4: 00000000001606e0
[   71.400558] Call Trace:
[   71.403299]  tcindex_destroy_element+0x25/0x40 [cls_tcindex]
[   71.409611]  tcindex_walk+0xbb/0x110 [cls_tcindex]
[   71.414953]  tcindex_destroy+0x44/0x90 [cls_tcindex]
[   71.420492]  ? tcindex_delete+0x280/0x280 [cls_tcindex]
[   71.426323]  tcf_proto_destroy+0x16/0x40
[   71.430696]  tcf_chain_flush+0x51/0x70
[   71.434876]  tcf_block_put_ext.part.30+0x8f/0x1b0
[   71.440122]  tcf_block_put+0x4d/0x70
[   71.444108]  cbq_destroy+0x4d/0xd0 [sch_cbq]
[   71.448869]  qdisc_destroy+0x62/0x130
[   71.452951]  dsmark_destroy+0x2a/0x70 [sch_dsmark]
[   71.458300]  qdisc_destroy+0x62/0x130
[   71.462373]  qdisc_graft+0x3ba/0x470
[   71.466359]  tc_get_qdisc+0x2a6/0x2c0
[   71.470443]  ? cred_has_capability+0x7d/0x130
[   71.475307]  rtnetlink_rcv_msg+0x263/0x2d0
[   71.479875]  ? rtnl_calcit.isra.30+0x110/0x110
[   71.484832]  netlink_rcv_skb+0x4d/0x130
[   71.489109]  netlink_unicast+0x1a3/0x250
[   71.493482]  netlink_sendmsg+0x2ae/0x3a0
[   71.497859]  sock_sendmsg+0x36/0x40
[   71.501748]  ___sys_sendmsg+0x26f/0x2d0
[   71.506029]  ? handle_pte_fault+0x586/0xdf0
[   71.510694]  ? __handle_mm_fault+0x389/0x500
[   71.515457]  ? __sys_sendmsg+0x5e/0xa0
[   71.519636]  __sys_sendmsg+0x5e/0xa0
[   71.523626]  do_syscall_64+0x5b/0x180
[   71.527711]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   71.533345] RIP: 0033:0x7f9d3e257f10
[   71.537331] Code: c3 48 8b 05 82 6f 2c 00 f7 db 64 89 18 48 83 cb ff eb dd 0f 1f 80 00 00 00 00 83 3d 8d d0 2c 00 00 75 10 b8 2e 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 31 c3 48 83 ec 08 e8
[   71.558401] RSP: 002b:00007fff6f893398 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[   71.566848] RAX: ffffffffffffffda RBX: 000000005b71274d RCX: 00007f9d3e257f10
[   71.574810] RDX: 0000000000000000 RSI: 00007fff6f8933e0 RDI: 0000000000000003
[   71.582770] RBP: 00007fff6f8933e0 R08: 000000000000ffff R09: 0000000000000003
[   71.590729] R10: 00007fff6f892e20 R11: 0000000000000246 R12: 0000000000000000
[   71.598689] R13: 0000000000662ee0 R14: 0000000000000000 R15: 0000000000000000
[   71.606651] Modules linked in: sch_cbq cls_tcindex sch_dsmark xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_coni
[   71.685425]  libahci i2c_algo_bit i2c_core i40e libata dca mdio megaraid_sas dm_mirror dm_region_hash dm_log dm_mod
[   71.697075] CR2: 0000000000000004
[   71.700792] ---[ end trace f604eb1acacd978b ]---

Reproducer:
tc qdisc add dev lo handle 1:0 root dsmark indices 64 set_tc_index
tc filter add dev lo parent 1:0 protocol ip prio 1 tcindex mask 0xfc shift 2
tc qdisc add dev lo parent 1:0 handle 2:0 cbq bandwidth 10Mbit cell 8 avpkt 1000 mpu 64
tc class add dev lo parent 2:0 classid 2:1 cbq bandwidth 10Mbit rate 1500Kbit avpkt 1000 prio 1 bounded isolated allot 1514 weight 1 maxburst 10
tc filter add dev lo parent 2:0 protocol ip prio 1 handle 0x2e tcindex classid 2:1 pass_on
tc qdisc add dev lo parent 2:1 pfifo limit 5
tc qdisc del dev lo root

This is because in tcindex_set_parms, when there is no old_r, we set new
exts to cr.exts. And we didn't set it to filter when r == &new_filter_result.

Then in tcindex_delete() -> tcf_exts_get_net(), we will get NULL pointer
dereference as we didn't init exts.

Fix it by moving tcf_exts_change() after "if (old_r && old_r != r)" check.
Then we don't need "cr" as there is no errout after that.

Fixes: bf63ac73b3 ("net_sched: fix an oops in tcindex filter")
Reported-by: Li Shuang <shuali@redhat.com>
Signed-off-by: Hangbin Liu <liuhangbin@gmail.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 19:37:42 -07:00
Linus Torvalds
de5d1b39ea Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull locking/atomics update from Thomas Gleixner:
 "The locking, atomics and memory model brains delivered:

   - A larger update to the atomics code which reworks the ordering
     barriers, consolidates the atomic primitives, provides the new
     atomic64_fetch_add_unless() primitive and cleans up the include
     hell.

   - Simplify cmpxchg() instrumentation and add instrumentation for
     xchg() and cmpxchg_double().

   - Updates to the memory model and documentation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (48 commits)
  locking/atomics: Rework ordering barriers
  locking/atomics: Instrument cmpxchg_double*()
  locking/atomics: Instrument xchg()
  locking/atomics: Simplify cmpxchg() instrumentation
  locking/atomics/x86: Reduce arch_cmpxchg64*() instrumentation
  tools/memory-model: Rename litmus tests to comply to norm7
  tools/memory-model/Documentation: Fix typo, smb->smp
  sched/Documentation: Update wake_up() & co. memory-barrier guarantees
  locking/spinlock, sched/core: Clarify requirements for smp_mb__after_spinlock()
  sched/core: Use smp_mb() in wake_woken_function()
  tools/memory-model: Add informal LKMM documentation to MAINTAINERS
  locking/atomics/Documentation: Describe atomic_set() as a write operation
  tools/memory-model: Make scripts executable
  tools/memory-model: Remove ACCESS_ONCE() from model
  tools/memory-model: Remove ACCESS_ONCE() from recipes
  locking/memory-barriers.txt/kokr: Update Korean translation to fix broken DMA vs. MMIO ordering example
  MAINTAINERS: Add Daniel Lustig as an LKMM reviewer
  tools/memory-model: Fix ISA2+pooncelock+pooncelock+pombonce name
  tools/memory-model: Add litmus test for full multicopy atomicity
  locking/refcount: Always allow checked forms
  ...
2018-08-13 12:23:39 -07:00
Vlad Buslov
42c625a486 net: sched: act_ife: disable bh when taking ife_mod_lock
Lockdep reports deadlock for following locking scenario in ife action:

Task one:
1) Executes ife action update.
2) Takes tcfa_lock.
3) Waits on ife_mod_lock which is already taken by task two.

Task two:

1) Executes any path that obtains ife_mod_lock without disabling bh (any
path that takes ife_mod_lock while holding tcfa_lock has bh disabled) like
loading a meta module, or creating new action.
2) Takes ife_mod_lock.
3) Task is preempted by rate estimator timer.
4) Timer callback waits on tcfa_lock which is taken by task one.

In described case tasks deadlock because they take same two locks in
different order. To prevent potential deadlock reported by lockdep, always
disable bh when obtaining ife_mod_lock.

Lockdep warning:

[  508.101192] =====================================================
[  508.107708] WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
[  508.114728] 4.18.0-rc8+ #646 Not tainted
[  508.119050] -----------------------------------------------------
[  508.125559] tc/5460 [HC0[0]:SC0[2]:HE1:SE0] is trying to acquire:
[  508.132025] 000000005a938c68 (ife_mod_lock){++++}, at: find_ife_oplist+0x1e/0xc0 [act_ife]
[  508.140996]
               and this task is already holding:
[  508.147548] 00000000d46f6c56 (&(&p->tcfa_lock)->rlock){+.-.}, at: tcf_ife_init+0x6ae/0xf40 [act_ife]
[  508.157371] which would create a new lock dependency:
[  508.162828]  (&(&p->tcfa_lock)->rlock){+.-.} -> (ife_mod_lock){++++}
[  508.169572]
               but this new dependency connects a SOFTIRQ-irq-safe lock:
[  508.178197]  (&(&p->tcfa_lock)->rlock){+.-.}
[  508.178201]
               ... which became SOFTIRQ-irq-safe at:
[  508.189771]   _raw_spin_lock+0x2c/0x40
[  508.193906]   est_fetch_counters+0x41/0xb0
[  508.198391]   est_timer+0x83/0x3c0
[  508.202180]   call_timer_fn+0x16a/0x5d0
[  508.206400]   run_timer_softirq+0x399/0x920
[  508.210967]   __do_softirq+0x157/0x97d
[  508.215102]   irq_exit+0x152/0x1c0
[  508.218888]   smp_apic_timer_interrupt+0xc0/0x4e0
[  508.223976]   apic_timer_interrupt+0xf/0x20
[  508.228540]   cpuidle_enter_state+0xf8/0x5d0
[  508.233198]   do_idle+0x28a/0x350
[  508.236881]   cpu_startup_entry+0xc7/0xe0
[  508.241296]   start_secondary+0x2e8/0x3f0
[  508.245678]   secondary_startup_64+0xa5/0xb0
[  508.250347]
               to a SOFTIRQ-irq-unsafe lock:  (ife_mod_lock){++++}
[  508.256531]
               ... which became SOFTIRQ-irq-unsafe at:
[  508.267279] ...
[  508.267283]   _raw_write_lock+0x2c/0x40
[  508.273653]   register_ife_op+0x118/0x2c0 [act_ife]
[  508.278926]   do_one_initcall+0xf7/0x4d9
[  508.283214]   do_init_module+0x18b/0x44e
[  508.287521]   load_module+0x4167/0x5730
[  508.291739]   __do_sys_finit_module+0x16d/0x1a0
[  508.296654]   do_syscall_64+0x7a/0x3f0
[  508.300788]   entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.306302]
               other info that might help us debug this:

[  508.315286]  Possible interrupt unsafe locking scenario:

[  508.322771]        CPU0                    CPU1
[  508.327681]        ----                    ----
[  508.332604]   lock(ife_mod_lock);
[  508.336300]                                local_irq_disable();
[  508.342608]                                lock(&(&p->tcfa_lock)->rlock);
[  508.349793]                                lock(ife_mod_lock);
[  508.355990]   <Interrupt>
[  508.358974]     lock(&(&p->tcfa_lock)->rlock);
[  508.363803]
                *** DEADLOCK ***

[  508.370715] 2 locks held by tc/5460:
[  508.374680]  #0: 00000000e27e4fa4 (rtnl_mutex){+.+.}, at: rtnetlink_rcv_msg+0x583/0x7b0
[  508.383366]  #1: 00000000d46f6c56 (&(&p->tcfa_lock)->rlock){+.-.}, at: tcf_ife_init+0x6ae/0xf40 [act_ife]
[  508.393648]
               the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
[  508.403505] -> (&(&p->tcfa_lock)->rlock){+.-.} ops: 1001553 {
[  508.409646]    HARDIRQ-ON-W at:
[  508.413136]                     _raw_spin_lock_bh+0x34/0x40
[  508.419059]                     gnet_stats_start_copy_compat+0xa2/0x230
[  508.426021]                     gnet_stats_start_copy+0x16/0x20
[  508.432333]                     tcf_action_copy_stats+0x95/0x1d0
[  508.438735]                     tcf_action_dump_1+0xb0/0x4e0
[  508.444795]                     tcf_action_dump+0xca/0x200
[  508.450673]                     tcf_exts_dump+0xd9/0x320
[  508.456392]                     fl_dump+0x1b7/0x4a0 [cls_flower]
[  508.462798]                     tcf_fill_node+0x380/0x530
[  508.468601]                     tfilter_notify+0xdf/0x1c0
[  508.474404]                     tc_new_tfilter+0x84a/0xc90
[  508.480270]                     rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.486419]                     netlink_rcv_skb+0x184/0x220
[  508.492394]                     netlink_unicast+0x31b/0x460
[  508.507411]                     netlink_sendmsg+0x3fb/0x840
[  508.513390]                     sock_sendmsg+0x7b/0xd0
[  508.518907]                     ___sys_sendmsg+0x4c6/0x610
[  508.524797]                     __sys_sendmsg+0xd7/0x150
[  508.530510]                     do_syscall_64+0x7a/0x3f0
[  508.536201]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.543301]    IN-SOFTIRQ-W at:
[  508.546834]                     _raw_spin_lock+0x2c/0x40
[  508.552522]                     est_fetch_counters+0x41/0xb0
[  508.558571]                     est_timer+0x83/0x3c0
[  508.563912]                     call_timer_fn+0x16a/0x5d0
[  508.569699]                     run_timer_softirq+0x399/0x920
[  508.575840]                     __do_softirq+0x157/0x97d
[  508.581538]                     irq_exit+0x152/0x1c0
[  508.586882]                     smp_apic_timer_interrupt+0xc0/0x4e0
[  508.593533]                     apic_timer_interrupt+0xf/0x20
[  508.599686]                     cpuidle_enter_state+0xf8/0x5d0
[  508.605895]                     do_idle+0x28a/0x350
[  508.611147]                     cpu_startup_entry+0xc7/0xe0
[  508.617097]                     start_secondary+0x2e8/0x3f0
[  508.623029]                     secondary_startup_64+0xa5/0xb0
[  508.629245]    INITIAL USE at:
[  508.632686]                    _raw_spin_lock_bh+0x34/0x40
[  508.638557]                    gnet_stats_start_copy_compat+0xa2/0x230
[  508.645491]                    gnet_stats_start_copy+0x16/0x20
[  508.651719]                    tcf_action_copy_stats+0x95/0x1d0
[  508.657992]                    tcf_action_dump_1+0xb0/0x4e0
[  508.663937]                    tcf_action_dump+0xca/0x200
[  508.669716]                    tcf_exts_dump+0xd9/0x320
[  508.675337]                    fl_dump+0x1b7/0x4a0 [cls_flower]
[  508.681650]                    tcf_fill_node+0x380/0x530
[  508.687366]                    tfilter_notify+0xdf/0x1c0
[  508.693031]                    tc_new_tfilter+0x84a/0xc90
[  508.698820]                    rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.704869]                    netlink_rcv_skb+0x184/0x220
[  508.710758]                    netlink_unicast+0x31b/0x460
[  508.716627]                    netlink_sendmsg+0x3fb/0x840
[  508.722510]                    sock_sendmsg+0x7b/0xd0
[  508.727931]                    ___sys_sendmsg+0x4c6/0x610
[  508.733729]                    __sys_sendmsg+0xd7/0x150
[  508.739346]                    do_syscall_64	+0x7a/0x3f0
[  508.744943]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.751930]  }
[  508.753964]  ... key      at: [<ffffffff916b3e20>] __key.61145+0x0/0x40
[  508.760946]  ... acquired at:
[  508.764294]    _raw_read_lock+0x2f/0x40
[  508.768513]    find_ife_oplist+0x1e/0xc0 [act_ife]
[  508.773692]    tcf_ife_init+0x82f/0xf40 [act_ife]
[  508.778785]    tcf_action_init_1+0x510/0x750
[  508.783468]    tcf_action_init+0x1e8/0x340
[  508.787938]    tcf_action_add+0xc5/0x240
[  508.792241]    tc_ctl_action+0x203/0x2a0
[  508.796550]    rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.801200]    netlink_rcv_skb+0x184/0x220
[  508.805674]    netlink_unicast+0x31b/0x460
[  508.810129]    netlink_sendmsg+0x3fb/0x840
[  508.814611]    sock_sendmsg+0x7b/0xd0
[  508.818665]    ___sys_sendmsg+0x4c6/0x610
[  508.823029]    __sys_sendmsg+0xd7/0x150
[  508.827246]    do_syscall_64+0x7a/0x3f0
[  508.831483]    entry_SYSCALL_64_after_hwframe+0x49/0xbe

               the dependencies between the lock to be acquired
[  508.838945]  and SOFTIRQ-irq-unsafe lock:
[  508.851177] -> (ife_mod_lock){++++} ops: 95 {
[  508.855920]    HARDIRQ-ON-W at:
[  508.859478]                     _raw_write_lock+0x2c/0x40
[  508.865264]                     register_ife_op+0x118/0x2c0 [act_ife]
[  508.872071]                     do_one_initcall+0xf7/0x4d9
[  508.877947]                     do_init_module+0x18b/0x44e
[  508.883819]                     load_module+0x4167/0x5730
[  508.889595]                     __do_sys_finit_module+0x16d/0x1a0
[  508.896043]                     do_syscall_64+0x7a/0x3f0
[  508.901734]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  508.908827]    HARDIRQ-ON-R at:
[  508.912359]                     _raw_read_lock+0x2f/0x40
[  508.918043]                     find_ife_oplist+0x1e/0xc0 [act_ife]
[  508.924692]                     tcf_ife_init+0x82f/0xf40 [act_ife]
[  508.931252]                     tcf_action_init_1+0x510/0x750
[  508.937393]                     tcf_action_init+0x1e8/0x340
[  508.943366]                     tcf_action_add+0xc5/0x240
[  508.949130]                     tc_ctl_action+0x203/0x2a0
[  508.954922]                     rtnetlink_rcv_msg+0x5bd/0x7b0
[  508.961024]                     netlink_rcv_skb+0x184/0x220
[  508.966970]                     netlink_unicast+0x31b/0x460
[  508.972915]                     netlink_sendmsg+0x3fb/0x840
[  508.978859]                     sock_sendmsg+0x7b/0xd0
[  508.984400]                     ___sys_sendmsg+0x4c6/0x610
[  508.990264]                     __sys_sendmsg+0xd7/0x150
[  508.995952]                     do_syscall_64+0x7a/0x3f0
[  509.001643]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.008722]    SOFTIRQ-ON-W at:\
[  509.012242]                     _raw_write_lock+0x2c/0x40
[  509.018013]                     register_ife_op+0x118/0x2c0 [act_ife]
[  509.024841]                     do_one_initcall+0xf7/0x4d9
[  509.030720]                     do_init_module+0x18b/0x44e
[  509.036604]                     load_module+0x4167/0x5730
[  509.042397]                     __do_sys_finit_module+0x16d/0x1a0
[  509.048865]                     do_syscall_64+0x7a/0x3f0
[  509.054551]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.061636]    SOFTIRQ-ON-R at:
[  509.065145]                     _raw_read_lock+0x2f/0x40
[  509.070854]                     find_ife_oplist+0x1e/0xc0 [act_ife]
[  509.077515]                     tcf_ife_init+0x82f/0xf40 [act_ife]
[  509.084051]                     tcf_action_init_1+0x510/0x750
[  509.090172]                     tcf_action_init+0x1e8/0x340
[  509.096124]                     tcf_action_add+0xc5/0x240
[  509.101891]                     tc_ctl_action+0x203/0x2a0
[  509.107671]                     rtnetlink_rcv_msg+0x5bd/0x7b0
[  509.113811]                     netlink_rcv_skb+0x184/0x220
[  509.119768]                     netlink_unicast+0x31b/0x460
[  509.125716]                     netlink_sendmsg+0x3fb/0x840
[  509.131668]                     sock_sendmsg+0x7b/0xd0
[  509.137167]                     ___sys_sendmsg+0x4c6/0x610
[  509.143010]                     __sys_sendmsg+0xd7/0x150
[  509.148718]                     do_syscall_64+0x7a/0x3f0
[  509.154443]                     entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.161533]    INITIAL USE at:
[  509.164956]                    _raw_read_lock+0x2f/0x40
[  509.170574]                    find_ife_oplist+0x1e/0xc0 [act_ife]
[  509.177134]                    tcf_ife_init+0x82f/0xf40 [act_ife]
[  509.183619]                    tcf_action_init_1+0x510/0x750
[  509.189674]                    tcf_action_init+0x1e8/0x340
[  509.195534]                    tcf_action_add+0xc5/0x240
[  509.201229]                    tc_ctl_action+0x203/0x2a0
[  509.206920]                    rtnetlink_rcv_msg+0x5bd/0x7b0
[  509.212936]                    netlink_rcv_skb+0x184/0x220
[  509.218818]                    netlink_unicast+0x31b/0x460
[  509.224699]                    netlink_sendmsg+0x3fb/0x840
[  509.230581]                    sock_sendmsg+0x7b/0xd0
[  509.235984]                    ___sys_sendmsg+0x4c6/0x610
[  509.241791]                    __sys_sendmsg+0xd7/0x150
[  509.247425]                    do_syscall_64+0x7a/0x3f0
[  509.253007]                    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.259975]  }
[  509.261998]  ... key      at: [<ffffffffc1554258>] ife_mod_lock+0x18/0xffffffffffff8dc0 [act_ife]
[  509.271569]  ... acquired at:
[  509.274912]    _raw_read_lock+0x2f/0x40
[  509.279134]    find_ife_oplist+0x1e/0xc0 [act_ife]
[  509.284324]    tcf_ife_init+0x82f/0xf40 [act_ife]
[  509.289425]    tcf_action_init_1+0x510/0x750
[  509.294068]    tcf_action_init+0x1e8/0x340
[  509.298553]    tcf_action_add+0xc5/0x240
[  509.302854]    tc_ctl_action+0x203/0x2a0
[  509.307153]    rtnetlink_rcv_msg+0x5bd/0x7b0
[  509.311805]    netlink_rcv_skb+0x184/0x220
[  509.316282]    netlink_unicast+0x31b/0x460
[  509.320769]    netlink_sendmsg+0x3fb/0x840
[  509.325248]    sock_sendmsg+0x7b/0xd0
[  509.329290]    ___sys_sendmsg+0x4c6/0x610
[  509.333687]    __sys_sendmsg+0xd7/0x150
[  509.337902]    do_syscall_64+0x7a/0x3f0
[  509.342116]    entry_SYSCALL_64_after_hwframe+0x49/0xbe
[  509.349601]
               stack backtrace:
[  509.354663] CPU: 6 PID: 5460 Comm: tc Not tainted 4.18.0-rc8+ #646
[  509.361216] Hardware name: Supermicro SYS-2028TP-DECR/X10DRT-P, BIOS 2.0b 03/30/2017

Fixes: ef6980b6be ("introduce IFE action")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 11:45:06 -07:00
David S. Miller
c1617fb4c5 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-08-13

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add driver XDP support for veth. This can be used in conjunction with
   redirect of another XDP program e.g. sitting on NIC so the xdp_frame
   can be forwarded to the peer veth directly without modification,
   from Toshiaki.

2) Add a new BPF map type REUSEPORT_SOCKARRAY and prog type SK_REUSEPORT
   in order to provide more control and visibility on where a SO_REUSEPORT
   sk should be located, and the latter enables to directly select a sk
   from the bpf map. This also enables map-in-map for application migration
   use cases, from Martin.

3) Add a new BPF helper bpf_skb_ancestor_cgroup_id() that returns the id
   of cgroup v2 that is the ancestor of the cgroup associated with the
   skb at the ancestor_level, from Andrey.

4) Implement BPF fs map pretty-print support based on BTF data for regular
   hash table and LRU map, from Yonghong.

5) Decouple the ability to attach BTF for a map from the key and value
   pretty-printer in BPF fs, and enable further support of BTF for maps for
   percpu and LPM trie, from Daniel.

6) Implement a better BPF sample of using XDP's CPU redirect feature for
   load balancing SKB processing to remote CPU. The sample implements the
   same XDP load balancing as Suricata does which is symmetric hash based
   on IP and L4 protocol, from Jesper.

7) Revert adding NULL pointer check with WARN_ON_ONCE() in __xdp_return()'s
   critical path as it is ensured that the allocator is present, from Björn.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 10:07:23 -07:00
Li RongQing
71e4128620 packet: switch kvzalloc to allocate memory
The patches includes following change:

*Use modern kvzalloc()/kvfree() instead of custom allocations.

*Remove order argument for alloc_pg_vec, it can get from req.

*Remove order argument for free_pg_vec, free_pg_vec now uses
kvfree which does not need order argument.

*Remove pg_vec_order from struct packet_ring_buffer, no longer
need to save/restore 'order'

*Remove variable 'order' for packet_set_ring, it is now unused

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:21:05 -07:00
Jamal Hadi Salim
7c5790c4da net: sched: act_mirred method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
8aa7f22e56 net: sched: act_vlan method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
353d2c253f net: sched: act_skbmod method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
45da1dac61 net: sched: act_skbedit method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
798de374e5 net: sched: act_simple method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
2ac063474d net: sched: act_police method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
6a2b401cd1 net: sched: act_pedit method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:17 -07:00
Jamal Hadi Salim
0390514fe1 net: sched: act_nat method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:16 -07:00
Jamal Hadi Salim
11b9695b3f net: sched: act_ipt method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:16 -07:00
Jamal Hadi Salim
1740005e2a net: sched: act_gact method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:16 -07:00
Jamal Hadi Salim
c831549c3f net: sched: act_sum method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:16 -07:00
Jamal Hadi Salim
2fbec27f81 net: sched: act_bpf method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:16 -07:00
Jamal Hadi Salim
962ad1f937 net: sched: act_connmark method rename for grep-ability and consistency
Signed-off-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 09:06:16 -07:00
YueHaibing
4de17aea5c crush: fix using plain integer as NULL warning
Fixes the following sparse warnings:

net/ceph/crush/mapper.c:517:76: warning: Using plain integer as NULL pointer
net/ceph/crush/mapper.c:728:68: warning: Using plain integer as NULL pointer

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
YueHaibing
bad87216fb libceph: remove unnecessary non NULL check for request_key
request_key never return NULL,so no need do non-NULL check.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-13 17:55:44 +02:00
Wei Wang
6d37fa49da l2tp: use sk_dst_check() to avoid race on sk->sk_dst_cache
In l2tp code, if it is a L2TP_UDP_ENCAP tunnel, tunnel->sk points to a
UDP socket. User could call sendmsg() on both this tunnel and the UDP
socket itself concurrently. As l2tp_xmit_skb() holds socket lock and call
__sk_dst_check() to refresh sk->sk_dst_cache, while udpv6_sendmsg() is
lockless and call sk_dst_check() to refresh sk->sk_dst_cache, there
could be a race and cause the dst cache to be freed multiple times.
So we fix l2tp side code to always call sk_dst_check() to garantee
xchg() is called when refreshing sk->sk_dst_cache to avoid race
conditions.

Syzkaller reported stack trace:
BUG: KASAN: use-after-free in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: use-after-free in atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
BUG: KASAN: use-after-free in atomic_add_unless include/linux/atomic.h:597 [inline]
BUG: KASAN: use-after-free in dst_hold_safe include/net/dst.h:308 [inline]
BUG: KASAN: use-after-free in ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
Read of size 4 at addr ffff8801aea9a880 by task syz-executor129/4829

CPU: 0 PID: 4829 Comm: syz-executor129 Not tainted 4.18.0-rc7-next-20180802+ #30
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
 __dump_stack lib/dump_stack.c:77 [inline]
 dump_stack+0x1c9/0x2b4 lib/dump_stack.c:113
 print_address_description+0x6c/0x20b mm/kasan/report.c:256
 kasan_report_error mm/kasan/report.c:354 [inline]
 kasan_report.cold.7+0x242/0x30d mm/kasan/report.c:412
 check_memory_region_inline mm/kasan/kasan.c:260 [inline]
 check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
 kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
 atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
 atomic_fetch_add_unless include/linux/atomic.h:575 [inline]
 atomic_add_unless include/linux/atomic.h:597 [inline]
 dst_hold_safe include/net/dst.h:308 [inline]
 ip6_hold_safe+0xe6/0x670 net/ipv6/route.c:1029
 rt6_get_pcpu_route net/ipv6/route.c:1249 [inline]
 ip6_pol_route+0x354/0xd20 net/ipv6/route.c:1922
 ip6_pol_route_output+0x54/0x70 net/ipv6/route.c:2098
 fib6_rule_lookup+0x283/0x890 net/ipv6/fib6_rules.c:122
 ip6_route_output_flags+0x2c5/0x350 net/ipv6/route.c:2126
 ip6_dst_lookup_tail+0x1278/0x1da0 net/ipv6/ip6_output.c:978
 ip6_dst_lookup_flow+0xc8/0x270 net/ipv6/ip6_output.c:1079
 ip6_sk_dst_lookup_flow+0x5ed/0xc50 net/ipv6/ip6_output.c:1117
 udpv6_sendmsg+0x2163/0x36b0 net/ipv6/udp.c:1354
 inet_sendmsg+0x1a1/0x690 net/ipv4/af_inet.c:798
 sock_sendmsg_nosec net/socket.c:622 [inline]
 sock_sendmsg+0xd5/0x120 net/socket.c:632
 ___sys_sendmsg+0x51d/0x930 net/socket.c:2115
 __sys_sendmmsg+0x240/0x6f0 net/socket.c:2210
 __do_sys_sendmmsg net/socket.c:2239 [inline]
 __se_sys_sendmmsg net/socket.c:2236 [inline]
 __x64_sys_sendmmsg+0x9d/0x100 net/socket.c:2236
 do_syscall_64+0x1b9/0x820 arch/x86/entry/common.c:290
 entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x446a29
Code: e8 ac b8 02 00 48 83 c4 18 c3 0f 1f 80 00 00 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 eb 08 fc ff c3 66 2e 0f 1f 84 00 00 00 00
RSP: 002b:00007f4de5532db8 EFLAGS: 00000246 ORIG_RAX: 0000000000000133
RAX: ffffffffffffffda RBX: 00000000006dcc38 RCX: 0000000000446a29
RDX: 00000000000000b8 RSI: 0000000020001b00 RDI: 0000000000000003
RBP: 00000000006dcc30 R08: 00007f4de5533700 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000246 R12: 00000000006dcc3c
R13: 00007ffe2b830fdf R14: 00007f4de55339c0 R15: 0000000000000001

Fixes: 71b1391a41 ("l2tp: ensure sk->dst is still valid")
Reported-by: syzbot+05f840f3b04f211bad55@syzkaller.appspotmail.com
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Cc: Guillaume Nault <g.nault@alphalink.fr>
Cc: David Ahern <dsahern@gmail.com>
Cc: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 08:43:36 -07:00
Virgile Jarry
e6f86b0f7a ipv6: Add icmp_echo_ignore_all support for ICMPv6
Preventing the kernel from responding to ICMP Echo Requests messages
can be useful in several ways. The sysctl parameter
'icmp_echo_ignore_all' can be used to prevent the kernel from
responding to IPv4 ICMP echo requests. For IPv6 pings, such
a sysctl kernel parameter did not exist.

Add the ability to prevent the kernel from responding to IPv6
ICMP echo requests through the use of the following sysctl
parameter : /proc/sys/net/ipv6/icmp/echo_ignore_all.
Update the documentation to reflect this change.

Signed-off-by: Virgile Jarry <virgile@acceis.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 08:42:25 -07:00
Vakul Garg
0b243d004e net/tls: Combined memory allocation for decryption request
For preparing decryption request, several memory chunks are required
(aead_req, sgin, sgout, iv, aad). For submitting the decrypt request to
an accelerator, it is required that the buffers which are read by the
accelerator must be dma-able and not come from stack. The buffers for
aad and iv can be separately kmalloced each, but it is inefficient.
This patch does a combined allocation for preparing decryption request
and then segments into aead_req || sgin || sgout || iv || aad.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-13 08:41:09 -07:00
piaojun
edcd9d9773 net/9p/trans_virtio.c: add null terminal for mount tag
chan->tag is Non-null terminated which will result in printing messy code
when debugging code. So we should add '\0' for tag to make the code more
convenient and robust. In addition, I drop char->tag_len to simplify the
code.

Link: http://lkml.kernel.org/r/5B641ECC.5030401@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
jiangyiwen
23cba9cbde 9p/virtio: fix off-by-one error in sg list bounds check
Because the value of limit is VIRTQUEUE_NUM, if index is equal to
limit, it will cause sg array out of bounds, so correct the judgement
of BUG_ON.

Link: http://lkml.kernel.org/r/5B63D5F6.6080109@huawei.com
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reported-By: Dan Carpenter <dan.carpenter@oracle.com>
Acked-by: Jun Piao <piaojun@huawei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:35:28 +09:00
Stephen Hemminger
c69f297d7f 9p: fix whitespace issues
Remove trailing whitespace and blank lines at EOF

Link: http://lkml.kernel.org/m/20180724192918.31165-11-sthemmin@microsoft.com
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
Tomas Bortoli
10aa14527f 9p: fix multiple NULL-pointer-dereferences
Added checks to prevent GPFs from raising.

Link: http://lkml.kernel.org/r/20180727110558.5479-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+1a262da37d3bead15c39@syzkaller.appspotmail.com
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
Tomas Bortoli
f984579a01 9p: validate PDU length
This commit adds length check for the PDU size.
The size contained in the header has to match the actual size,
except for TCP (trans_fd.c) where actual length is not known ahead
and the header's length will be checked only against the validity
range.

Link: http://lkml.kernel.org/r/20180723154404.2406-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+65c6b72f284a39d416b4@syzkaller.appspotmail.com
To: Eric Van Hensbergen <ericvh@gmail.com>
To: Ron Minnich <rminnich@sandia.gov>
To: Latchesar Ionkov <lucho@ionkov.net>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
Tomas Bortoli
9f476d7c54 net/9p/trans_fd.c: fix race by holding the lock
It may be possible to run p9_fd_cancel() with a deleted req->req_list
and incur in a double del. To fix hold the client->lock while changing
the status, so the other threads will be synchronized.

Link: http://lkml.kernel.org/r/20180723184253.6682-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+735d926e9d1317c3310c@syzkaller.appspotmail.com
To: Eric Van Hensbergen <ericvh@gmail.com>
To: Ron Minnich <rminnich@sandia.gov>
To: Latchesar Ionkov <lucho@ionkov.net>
Cc: Yiwen Jiang <jiangyiwen@huwei.com>
Cc: David S. Miller <davem@davemloft.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
Tomas Bortoli
430ac66eb4 net/9p/trans_fd.c: fix race-condition by flushing workqueue before the kfree()
The patch adds the flush in p9_mux_poll_stop() as it the function used by
p9_conn_destroy(), in turn called by p9_fd_close() to stop the async
polling associated with the data regarding the connection.

Link: http://lkml.kernel.org/r/20180720092730.27104-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+39749ed7d9ef6dfb23f6@syzkaller.appspotmail.com
To: Eric Van Hensbergen <ericvh@gmail.com>
To: Ron Minnich <rminnich@sandia.gov>
To: Latchesar Ionkov <lucho@ionkov.net>
Cc: Yiwen Jiang <jiangyiwen@huwei.com>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
jiangyiwen
31934da810 net/9p/virtio: Fix hard lockup in req_done
When client has multiple threads that issue io requests
all the time, and the server has a very good performance,
it may cause cpu is running in the irq context for a long
time because it can check virtqueue has buf in the *while*
loop.

So we should keep chan->lock in the whole loop.

[ Dominique: reworded subject line ]
Link: http://lkml.kernel.org/r/5B503AEC.5080404@huawei.com
Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
To: Andrew Morton <akpm@linux-foundation.org>
To: Eric Van Hensbergen <ericvh@gmail.com>
To: Ron Minnich <rminnich@sandia.gov>
To: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
piaojun
c7ebbae7cf net/9p/trans_virtio.c: fix some spell mistakes in comments
Fix spelling mistake in comments of p9_virtio_zc_request().

Link: http://lkml.kernel.org/r/5B4EB7D9.9010108@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:58 +09:00
Chirantan Ekbote
d28c756cae 9p/net: Fix zero-copy path in the 9p virtio transport
The zero-copy optimization when reading or writing large chunks of data
is quite useful.  However, the 9p messages created through the zero-copy
write path have an incorrect message size: it should be the size of the
header + size of the data being written but instead it's just the size
of the header.

This only works if the server ignores the size field of the message and
otherwise breaks the framing of the protocol. Fix this by re-writing the
message size field with the correct value.

Tested by running `dd if=/dev/zero of=out bs=4k count=1` inside a
virtio-9p mount.

Link: http://lkml.kernel.org/r/20180717003529.114368-1-chirantan@chromium.org
Signed-off-by: Chirantan Ekbote <chirantan@chromium.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Tested-by: Greg Kurz <groug@kaod.org>
Cc: Dylan Reid <dgreid@chromium.org>
Cc: Guenter Roeck <groeck@chromium.org>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:34:59 +09:00
Matthew Wilcox
2557d0c57c 9p: Embed wait_queue_head into p9_req_t
On a 64-bit system, the wait_queue_head_t is 24 bytes while the pointer
to it is 8 bytes.  Growing the p9_req_t by 16 bytes is better than
performing a 24-byte memory allocation.

Link: http://lkml.kernel.org/r/20180711210225.19730-5-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Matthew Wilcox
f28cdf0430 9p: Replace the fidlist with an IDR
The p9_idpool being used to allocate the IDs uses an IDR to allocate
the IDs ... which we then keep in a doubly-linked list, rather than in
the IDR which allocated them.  We can use an IDR directly which saves
two pointers per p9_fid, and a tiny memory allocation per p9_client.

Link: http://lkml.kernel.org/r/20180711210225.19730-4-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Matthew Wilcox
b5303be2be 9p: Change p9_fid_create calling convention
Return NULL instead of ERR_PTR when we can't allocate a FID.  The ENOSPC
return value was getting all the way back to userspace, and that's
confusing for a userspace program which isn't expecting read() to tell it
there's no space left on the filesystem.  The best error we can return to
indicate a temporary failure caused by lack of client resources is ENOMEM.

Maybe it would be better to sleep until a FID is available, but that's
not a change I'm comfortable making.

Link: http://lkml.kernel.org/r/20180711210225.19730-3-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Reviewed-by: Yiwen Jiang <jiangyiwen@huwei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Matthew Wilcox
2d58f63f72 9p: Fix comment on smp_wmb
The previous comment misled me into thinking the barrier wasn't needed
at all.

Link: http://lkml.kernel.org/r/20180711210225.19730-2-willy@infradead.org
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Tomas Bortoli
7913690dcc net/9p/client.c: version pointer uninitialized
The p9_client_version() does not initialize the version pointer. If the
call to p9pdu_readf() returns an error and version has not been allocated
in p9pdu_readf(), then the program will jump to the "error" label and will
try to free the version pointer. If version is not initialized, free()
will be called with uninitialized, garbage data and will provoke a crash.

Link: http://lkml.kernel.org/r/20180709222943.19503-1-tomasbortoli@gmail.com
Signed-off-by: Tomas Bortoli <tomasbortoli@gmail.com>
Reported-by: syzbot+65c6b72f284a39d416b4@syzkaller.appspotmail.com
Reviewed-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Jean-Philippe Brucker
92aef4675d net/9p: fix error path of p9_virtio_probe
Currently when virtio_find_single_vq fails, we go through del_vqs which
throws a warning (Trying to free already-free IRQ).  Skip del_vqs if vq
allocation failed.

Link: http://lkml.kernel.org/r/20180524101021.49880-1-jean-philippe.brucker@arm.com
Signed-off-by: Jean-Philippe Brucker <jean-philippe.brucker@arm.com>
Reviewed-by: Greg Kurz <groug@kaod.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
piaojun
b87d1d2652 9p/net/protocol.c: return -ENOMEM when kmalloc() failed
We should return -ENOMEM to upper user when kmalloc failed to indicate
accurate errno.

Link: http://lkml.kernel.org/r/5B4552C5.60000@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
piaojun
64ad31f328 net/9p/client.c: add missing '\n' at the end of p9_debug()
In p9_client_getattr_dotl(), we should add '\n' at the end of printing
log.

Link: http://lkml.kernel.org/r/5B44589A.50302@huawei.com
Signed-off-by: Jun Piao <piaojun@huawei.com>
Reviewed-by: Yiwen Jiang <jiangyiwen@huawei.com>
Cc: Eric Van Hensbergen <ericvh@gmail.com>
Cc: Ron Minnich <rminnich@sandia.gov>
Cc: Latchesar Ionkov <lucho@ionkov.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dominique Martinet <dominique.martinet@cea.fr>
2018-08-13 09:21:44 +09:00
Andrey Ignatov
7723628101 bpf: Introduce bpf_skb_ancestor_cgroup_id helper
== Problem description ==

It's useful to be able to identify cgroup associated with skb in TC so
that a policy can be applied to this skb, and existing bpf_skb_cgroup_id
helper can help with this.

Though in real life cgroup hierarchy and hierarchy to apply a policy to
don't map 1:1.

It's often the case that there is a container and corresponding cgroup,
but there are many more sub-cgroups inside container, e.g. because it's
delegated to containerized application to control resources for its
subsystems, or to separate application inside container from infra that
belongs to containerization system (e.g. sshd).

At the same time it may be useful to apply a policy to container as a
whole.

If multiple containers like this are run on a host (what is often the
case) and many of them have sub-cgroups, it may not be possible to apply
per-container policy in TC with existing helpers such as
bpf_skb_under_cgroup or bpf_skb_cgroup_id:

* bpf_skb_cgroup_id will return id of immediate cgroup associated with
  skb, i.e. if it's a sub-cgroup inside container, it can't be used to
  identify container's cgroup;

* bpf_skb_under_cgroup can work only with one cgroup and doesn't scale,
  i.e. if there are N containers on a host and a policy has to be
  applied to M of them (0 <= M <= N), it'd require M calls to
  bpf_skb_under_cgroup, and, if M changes, it'd require to rebuild &
  load new BPF program.

== Solution ==

The patch introduces new helper bpf_skb_ancestor_cgroup_id that can be
used to get id of cgroup v2 that is an ancestor of cgroup associated
with skb at specified level of cgroup hierarchy.

That way admin can place all containers on one level of cgroup hierarchy
(what is a good practice in general and already used in many
configurations) and identify specific cgroup on this level no matter
what sub-cgroup skb is associated with.

E.g. if there is a cgroup hierarchy:
  root/
  root/container1/
  root/container1/app11/
  root/container1/app11/sub-app-a/
  root/container1/app12/
  root/container2/
  root/container2/app21/
  root/container2/app22/
  root/container2/app22/sub-app-b/

, then having skb associated with root/container1/app11/sub-app-a/ it's
possible to get ancestor at level 1, what is container1 and apply policy
for this container, or apply another policy if it's container2.

Policies can be kept e.g. in a hash map where key is a container cgroup
id and value is an action.

Levels where container cgroups are created are usually known in advance
whether cgroup hierarchy inside container may be hard to predict
especially in case when its creation is delegated to containerized
application.

== Implementation details ==

The helper gets ancestor by walking parents up to specified level.

Another option would be to get different kind of "id" from
cgroup->ancestor_ids[level] and use it with idr_find() to get struct
cgroup for ancestor. But that would require radix lookup what doesn't
seem to be better (at least it's not obviously better).

Format of return value of the new helper is same as that of
bpf_skb_cgroup_id.

Signed-off-by: Andrey Ignatov <rdna@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-13 01:02:39 +02:00
Peter Oskolkov
a4fd284a1f ip: process in-order fragments efficiently
This patch changes the runtime behavior of IP defrag queue:
incoming in-order fragments are added to the end of the current
list/"run" of in-order fragments at the tail.

On some workloads, UDP stream performance is substantially improved:

RX: ./udp_stream -F 10 -T 2 -l 60
TX: ./udp_stream -c -H <host> -F 10 -T 5 -l 60

with this patchset applied on a 10Gbps receiver:

  throughput=9524.18
  throughput_units=Mbit/s

upstream (net-next):

  throughput=4608.93
  throughput_units=Mbit/s

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 17:54:18 -07:00
Peter Oskolkov
353c9cb360 ip: add helpers to process in-order fragments faster.
This patch introduces several helper functions/macros that will be
used in the follow-up patch. No runtime changes yet.

The new logic (fully implemented in the second patch) is as follows:

* Nodes in the rb-tree will now contain not single fragments, but lists
  of consecutive fragments ("runs").

* At each point in time, the current "active" run at the tail is
  maintained/tracked. Fragments that arrive in-order, adjacent
  to the previous tail fragment, are added to this tail run without
  triggering the re-balancing of the rb-tree.

* If a fragment arrives out of order with the offset _before_ the tail run,
  it is inserted into the rb-tree as a single fragment.

* If a fragment arrives after the current tail fragment (with a gap),
  it starts a new "tail" run, as is inserted into the rb-tree
  at the end as the head of the new run.

skb->cb is used to store additional information
needed here (suggested by Eric Dumazet).

Reported-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 17:54:18 -07:00
Vlad Buslov
e329bc4273 net: sched: act_police: remove dependency on rtnl lock
Use tcf spinlock to protect police action private data from concurrent
modification during dump. (init already uses tcf spinlock when changing
police action state)

Pass tcf spinlock as estimator lock argument to gen_replace_estimator()
during action init.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:10 -07:00
Vlad Buslov
51a9f5ae65 net: core: protect rate estimator statistics pointer with lock
Extend gen_new_estimator() to also take stats_lock when re-assigning rate
estimator statistics pointer. (to be used by unlocked actions)

Rename 'stats_lock' to 'lock' and change argument description to explain
that it is now also used for control path.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:10 -07:00
Vlad Buslov
4e232818bd net: sched: act_mirred: remove dependency on rtnl lock
Re-introduce mirred list spinlock, that was removed some time ago, in order
to protect it from concurrent modifications, instead of relying on rtnl
lock.

Use tcf spinlock to protect mirred action private data from concurrent
modification in init and dump. Rearrange access to mirred data in order to
be performed only while holding the lock.

Rearrange net dev access to always hold reference while working with it,
instead of relying on rntl lock.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:10 -07:00
Vlad Buslov
84a75b329b net: sched: extend action ops with put_dev callback
As a preparation for removing dependency on rtnl lock from rules update
path, all users of shared objects must take reference while working with
them.

Extend action ops with put_dev() API to be used on net device returned by
get_dev().

Modify mirred action (only action that implements get_dev callback):
- Take reference to net device in get_dev.
- Implement put_dev API that releases reference to net device.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:10 -07:00
Vlad Buslov
764e9a2448 net: sched: act_vlan: remove dependency on rtnl lock
Use tcf spinlock to protect vlan action private data from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:10 -07:00
Vlad Buslov
729e012609 net: sched: act_tunnel_key: remove dependency on rtnl lock
Use tcf lock to protect tunnel key action struct private data from
concurrent modification in init and dump. Use rcu swap operation to
reassign params pointer under protection of tcf lock. (old params value is
not used by init, so there is no need of standalone rcu dereference step)

Remove rtnl lock assertion that is no longer required.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:10 -07:00
Vlad Buslov
c8814552fe net: sched: act_skbmod: remove dependency on rtnl lock
Move read of skbmod_p rcu pointer to be protected by tcf spinlock. Use tcf
spinlock to protect private skbmod data from concurrent modification during
dump.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
5e48180ed8 net: sched: act_simple: remove dependency on rtnl lock
Use tcf spinlock to protect private simple action data from concurrent
modification during dump. (simple init already uses tcf spinlock when
changing action state)

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
d772849566 net: sched: act_sample: remove dependency on rtnl lock
Use tcf spinlock to protect private sample action data from concurrent
modification during dump and init.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
67b0c1a3c9 net: sched: act_pedit: remove dependency on rtnl lock
Rearrange pedit init code to only access pedit action data while holding
tcf spinlock. Change keys allocation type to atomic to allow it to execute
while holding tcf spinlock. Take tcf spinlock in dump function when
accessing pedit action data.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
ff25276de9 net: sched: act_ipt: remove dependency on rtnl lock
Use tcf spinlock to protect ipt action private data from concurrent
modification during dump. Ipt init already takes tcf spinlock when
modifying ipt state.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
54d0d423a4 net: sched: act_ife: remove dependency on rtnl lock
Use tcf spinlock and rcu to protect params pointer from concurrent
modification during dump and init. Use rcu swap operation to reassign
params pointer under protection of tcf lock. (old params value is not used
by init, so there is no need of standalone rcu dereference step)

Ife action has meta-actions that are compiled as standalone modules. Rtnl
mutex must be released while loading a kernel module. In order to support
execution without rtnl mutex, propagate 'rtnl_held' argument to meta action
loading functions. When requesting meta action module, conditionally
release rtnl lock depending on 'rtnl_held' argument.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
e8917f4370 net: sched: act_gact: remove dependency on rtnl lock
Use tcf spinlock to protect gact action private state from concurrent
modification during dump and init. Remove rtnl assertion that is no longer
necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
b6a2b971c0 net: sched: act_csum: remove dependency on rtnl lock
Use tcf lock to protect csum action struct private data from concurrent
modification in init and dump. Use rcu swap operation to reassign params
pointer under protection of tcf lock. (old params value is not used by
init, so there is no need of standalone rcu dereference step)

Remove rtnl assertion that is no longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Vlad Buslov
2142236b45 net: sched: act_bpf: remove dependency on rtnl lock
Use tcf spinlock to protect bpf action private data from concurrent
modification during dump and init. Remove rtnl lock assertion that is no
longer necessary.

Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:37:09 -07:00
Konstantin Khorenko
0d493b4d0b net/sctp: Replace in/out stream arrays with flex_array
This path replaces physically contiguous memory arrays
allocated using kmalloc_array() with flexible arrays.
This enables to avoid memory allocation failures on the
systems under a memory stress.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:25:15 -07:00
Konstantin Khorenko
05364ca03c net/sctp: Make wrappers for accessing in/out streams
This patch introduces wrappers for accessing in/out streams indirectly.
This will enable to replace physically contiguous memory arrays
of streams with flexible arrays (or maybe any other appropriate
mechanism) which do memory allocation on a per-page basis.

Signed-off-by: Oleg Babin <obabin@virtuozzo.com>
Signed-off-by: Konstantin Khorenko <khorenko@virtuozzo.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:25:15 -07:00
Guillaume Nault
4f5f85e9a7 l2tp: let pppol2tp_ioctl() fallback to dev_ioctl()
Return -ENOIOCTLCMD for unknown ioctl commands. This lets dev_ioctl()
handle generic socket ioctls like SIOCGIFNAME or SIOCGIFINDEX.
PF_PPPOX/PX_PROTO_OL2TP was one of the few socket types not honouring
this mechanism.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
7390ed8a40 l2tp: zero out stats in pppol2tp_copy_stats()
Integrate memset(0) in pppol2tp_copy_stats() to avoid calling it
manually every time.

While there, constify 'stats'.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
b0e29063dc l2tp: remove pppol2tp_session_ioctl()
pppol2tp_ioctl() has everything in place for handling PPPIOCGL2TPSTATS
on session sockets. We just need to copy the stats and set ->session_id.

As a side effect of sharing session and tunnel code, ->using_ipsec is
properly set even when the request was made using a session socket.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
528534f0de l2tp: remove pppol2tp_tunnel_ioctl()
Handle PPPIOCGL2TPSTATS in pppol2tp_ioctl() if the socket represents a
tunnel. This one is a bit special because the caller may use the tunnel
socket to retrieve statistics of one of its sessions. If the session_id
is set, the corresponding session's statistics are returned, instead of
those of the tunnel. This is handled by the new
pppol2tp_tunnel_copy_stats() helper function.

Set ->tunnel_id and ->using_ipsec out of the conditional, so
that it can be used by the 'else' branch in the following patch.
We cannot do that for ->session_id, because tunnel sockets have to
report the value that was originally passed in 'stats.session_id',
while session sockets have to report their own session_id.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
79e6760e64 l2tp: handle PPPIOC[GS]MRU and PPPIOC[GS]FLAGS in pppol2tp_ioctl()
Let pppol2tp_ioctl() handle ioctl commands directly. It still relies on
pppol2tp_{session,tunnel}_ioctl() for PPPIOCGL2TPSTATS.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
bdd0292f96 l2tp: simplify pppol2tp_ioctl()
* Drop test on 'sk': sock->sk cannot be NULL, or pppox_ioctl() could
    not have called us.

  * Drop test on 'SOCK_DEAD' state: if this flag was set, the socket
    would be in the process of being released and no ioctl could be
    running anymore.

  * Drop test on 'PPPOX_*' state: we depend on ->sk_user_data to get
    the session structure. If it is non-NULL, then the socket is
    connected. Testing for PPPOX_* is redundant.

  * Retrieve session using ->sk_user_data directly, instead of going
    through pppol2tp_sock_to_session(). This avoids grabbing a useless
    reference on the socket.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
01e28b921b l2tp: split l2tp_session_get()
l2tp_session_get() is used for two different purposes. If 'tunnel' is
NULL, the session is searched globally in the supplied network
namespace. Otherwise it is searched exclusively in the tunnel context.

Callers always know the context in which they need to search the
session. But some of them do provide both a namespace and a tunnel,
making the semantic of the call unclear.

This patch defines l2tp_tunnel_get_session() for lookups done in a
tunnel and restricts l2tp_session_get() to namespace searches.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Guillaume Nault
d6a61ec936 l2tp: define l2tp_tunnel_uses_xfrm()
Use helper function to figure out if a tunnel is using ipsec.
Also, avoid accessing ->sk_policy directly since it's RCU protected.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 12:13:49 -07:00
Yuchung Cheng
fd2123a3d7 tcp: avoid resetting ACK timer upon receiving packet with ECN CWR flag
Previously commit 9aee400061 ("tcp: ack immediately when a cwr
packet arrives") calls tcp_enter_quickack_mode to force sending
two immediate ACKs upon receiving a packet w/ CWR flag. The side
effect is it'll also reset the delayed ACK timer and interactive
session tracking. This patch removes that side effect by using the
new ACK_NOW flag to force an immmediate ACK.

Packetdrill to demonstrate:

    0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
   +0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
   +0 setsockopt(3, SOL_TCP, TCP_CONGESTION, "dctcp", 5) = 0
   +0 bind(3, ..., ...) = 0
   +0 listen(3, 1) = 0

   +0 < [ect0] SEW 0:0(0) win 32792 <mss 1000,sackOK,nop,nop,nop,wscale 7>
   +0 > SE. 0:0(0) ack 1 <mss 1460,nop,nop,sackOK,nop,wscale 8>
  +.1 < [ect0] . 1:1(0) ack 1 win 257
   +0 accept(3, ..., ...) = 4

   +0 < [ect0] . 1:1001(1000) ack 1 win 257
   +0 > [ect01] . 1:1(0) ack 1001

   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 1:2(1) ack 1001

   +0 < [ect0] . 1001:2001(1000) ack 2 win 257
   +0 write(4, ..., 1) = 1
   +0 > [ect01] P. 2:3(1) ack 2001

   +0 < [ect0] . 2001:3001(1000) ack 3 win 257
   +0 < [ect0] . 3001:4001(1000) ack 3 win 257
   // Ack delayed ...

   +.01 < [ce] P. 4001:4501(500) ack 3 win 257
   +0 > [ect01] . 3:3(0) ack 4001
   +0 > [ect01] E. 3:3(0) ack 4501

+.001 read(4, ..., 4500) = 4500
   +0 write(4, ..., 1) = 1
   +0 > [ect01] PE. 3:4(1) ack 4501 win 100

 +.01 < [ect0] W. 4501:5501(1000) ack 4 win 257
   // No delayed ACK on CWR flag
   +0 > [ect01] . 4:4(0) ack 5501

 +.31 < [ect0] . 5501:6501(1000) ack 4 win 257
   +0 > [ect01] . 4:4(0) ack 6501

Fixes: 9aee400061 ("tcp: ack immediately when a cwr packet arrives")
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:31:35 -07:00
Yuchung Cheng
15bdd5686c tcp: always ACK immediately on hole repairs
RFC 5681 sec 4.2:
  To provide feedback to senders recovering from losses, the receiver
  SHOULD send an immediate ACK when it receives a data segment that
  fills in all or part of a gap in the sequence space.

When a gap is partially filled, __tcp_ack_snd_check already checks
the out-of-order queue and correctly send an immediate ACK. However
when a gap is fully filled, the previous implementation only resets
pingpong mode which does not guarantee an immediate ACK because the
quick ACK counter may be zero. This patch addresses this issue by
marking the one-time immediate ACK flag instead.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:31:35 -07:00
Yuchung Cheng
d2ccd7bc8a tcp: avoid resetting ACK timer in DCTCP
The recent fix of acking immediately in DCTCP on CE status change
has an undesirable side-effect: it also resets TCP ack timer and
disables pingpong mode (interactive session). But the CE status
change has nothing to do with them. This patch addresses that by
using the new one-time immediate ACK flag instead of calling
tcp_enter_quickack_mode().

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:31:35 -07:00
Yuchung Cheng
466466dc6c tcp: mandate a one-time immediate ACK
Add a new flag to indicate a one-time immediate ACK. This flag is
occasionaly set under specific TCP protocol states in addition to
the more common quickack mechanism for interactive application.

In several cases in the TCP code we want to force an immediate ACK
but do not want to call tcp_enter_quickack_mode() because we do
not want to forget the icsk_ack.pingpong or icsk_ack.ato state.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:31:35 -07:00
Colin Ian King
98ed1e642c rxrpc: remove redundant static int 'zero'
The static int 'zero' is defined but is never used hence it is
redundant and can be removed. The use of this variable was removed
with commit a158bdd324 ("rxrpc: Fix call timeouts").

Cleans up clang warning:
warning: 'zero' defined but not used [-Wunused-const-variable=]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-11 11:25:18 -07:00
Martin KaFai Lau
8217ca653e bpf: Enable BPF_PROG_TYPE_SK_REUSEPORT bpf prog in reuseport selection
This patch allows a BPF_PROG_TYPE_SK_REUSEPORT bpf prog to select a
SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY introduced in
the earlier patch.  "bpf_run_sk_reuseport()" will return -ECONNREFUSED
when the BPF_PROG_TYPE_SK_REUSEPORT prog returns SK_DROP.
The callers, in inet[6]_hashtable.c and ipv[46]/udp.c, are modified to
handle this case and return NULL immediately instead of continuing the
sk search from its hashtable.

It re-uses the existing SO_ATTACH_REUSEPORT_EBPF setsockopt to attach
BPF_PROG_TYPE_SK_REUSEPORT.  The "sk_reuseport_attach_bpf()" will check
if the attaching bpf prog is in the new SK_REUSEPORT or the existing
SOCKET_FILTER type and then check different things accordingly.

One level of "__reuseport_attach_prog()" call is removed.  The
"sk_unhashed() && ..." and "sk->sk_reuseport_cb" tests are pushed
back to "reuseport_attach_prog()" in sock_reuseport.c.  sock_reuseport.c
seems to have more knowledge on those test requirements than filter.c.
In "reuseport_attach_prog()", after new_prog is attached to reuse->prog,
the old_prog (if any) is also directly freed instead of returning the
old_prog to the caller and asking the caller to free.

The sysctl_optmem_max check is moved back to the
"sk_reuseport_attach_filter()" and "sk_reuseport_attach_bpf()".
As of other bpf prog types, the new BPF_PROG_TYPE_SK_REUSEPORT is only
bounded by the usual "bpf_prog_charge_memlock()" during load time
instead of bounded by both bpf_prog_charge_memlock and sysctl_optmem_max.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
Martin KaFai Lau
2dbb9b9e6d bpf: Introduce BPF_PROG_TYPE_SK_REUSEPORT
This patch adds a BPF_PROG_TYPE_SK_REUSEPORT which can select
a SO_REUSEPORT sk from a BPF_MAP_TYPE_REUSEPORT_ARRAY.  Like other
non SK_FILTER/CGROUP_SKB program, it requires CAP_SYS_ADMIN.

BPF_PROG_TYPE_SK_REUSEPORT introduces "struct sk_reuseport_kern"
to store the bpf context instead of using the skb->cb[48].

At the SO_REUSEPORT sk lookup time, it is in the middle of transiting
from a lower layer (ipv4/ipv6) to a upper layer (udp/tcp).  At this
point,  it is not always clear where the bpf context can be appended
in the skb->cb[48] to avoid saving-and-restoring cb[].  Even putting
aside the difference between ipv4-vs-ipv6 and udp-vs-tcp.  It is not
clear if the lower layer is only ipv4 and ipv6 in the future and
will it not touch the cb[] again before transiting to the upper
layer.

For example, in udp_gro_receive(), it uses the 48 byte NAPI_GRO_CB
instead of IP[6]CB and it may still modify the cb[] after calling
the udp[46]_lib_lookup_skb().  Because of the above reason, if
sk->cb is used for the bpf ctx, saving-and-restoring is needed
and likely the whole 48 bytes cb[] has to be saved and restored.

Instead of saving, setting and restoring the cb[], this patch opts
to create a new "struct sk_reuseport_kern" and setting the needed
values in there.

The new BPF_PROG_TYPE_SK_REUSEPORT and "struct sk_reuseport_(kern|md)"
will serve all ipv4/ipv6 + udp/tcp combinations.  There is no protocol
specific usage at this point and it is also inline with the current
sock_reuseport.c implementation (i.e. no protocol specific requirement).

In "struct sk_reuseport_md", this patch exposes data/data_end/len
with semantic similar to other existing usages.  Together
with "bpf_skb_load_bytes()" and "bpf_skb_load_bytes_relative()",
the bpf prog can peek anywhere in the skb.  The "bind_inany" tells
the bpf prog that the reuseport group is bind-ed to a local
INANY address which cannot be learned from skb.

The new "bind_inany" is added to "struct sock_reuseport" which will be
used when running the new "BPF_PROG_TYPE_SK_REUSEPORT" bpf prog in order
to avoid repeating the "bind INANY" test on
"sk_v6_rcv_saddr/sk->sk_rcv_saddr" every time a bpf prog is run.  It can
only be properly initialized when a "sk->sk_reuseport" enabled sk is
adding to a hashtable (i.e. during "reuseport_alloc()" and
"reuseport_add_sock()").

The new "sk_select_reuseport()" is the main helper that the
bpf prog will use to select a SO_REUSEPORT sk.  It is the only function
that can use the new BPF_MAP_TYPE_REUSEPORT_ARRAY.  As mentioned in
the earlier patch, the validity of a selected sk is checked in
run time in "sk_select_reuseport()".  Doing the check in
verification time is difficult and inflexible (consider the map-in-map
use case).  The runtime check is to compare the selected sk's reuseport_id
with the reuseport_id that we want.  This helper will return -EXXX if the
selected sk cannot serve the incoming request (e.g. reuseport_id
not match).  The bpf prog can decide if it wants to do SK_DROP as its
discretion.

When the bpf prog returns SK_PASS, the kernel will check if a
valid sk has been selected (i.e. "reuse_kern->selected_sk != NULL").
If it does , it will use the selected sk.  If not, the kernel
will select one from "reuse->socks[]" (as before this patch).

The SK_DROP and SK_PASS handling logic will be in the next patch.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
Martin KaFai Lau
5dc4c4b7d4 bpf: Introduce BPF_MAP_TYPE_REUSEPORT_SOCKARRAY
This patch introduces a new map type BPF_MAP_TYPE_REUSEPORT_SOCKARRAY.

To unleash the full potential of a bpf prog, it is essential for the
userspace to be capable of directly setting up a bpf map which can then
be consumed by the bpf prog to make decision.  In this case, decide which
SO_REUSEPORT sk to serve the incoming request.

By adding BPF_MAP_TYPE_REUSEPORT_SOCKARRAY, the userspace has total control
and visibility on where a SO_REUSEPORT sk should be located in a bpf map.
The later patch will introduce BPF_PROG_TYPE_SK_REUSEPORT such that
the bpf prog can directly select a sk from the bpf map.  That will
raise the programmability of the bpf prog attached to a reuseport
group (a group of sk serving the same IP:PORT).

For example, in UDP, the bpf prog can peek into the payload (e.g.
through the "data" pointer introduced in the later patch) to learn
the application level's connection information and then decide which sk
to pick from a bpf map.  The userspace can tightly couple the sk's location
in a bpf map with the application logic in generating the UDP payload's
connection information.  This connection info contact/API stays within the
userspace.

Also, when used with map-in-map, the userspace can switch the
old-server-process's inner map to a new-server-process's inner map
in one call "bpf_map_update_elem(outer_map, &index, &new_reuseport_array)".
The bpf prog will then direct incoming requests to the new process instead
of the old process.  The old process can finish draining the pending
requests (e.g. by "accept()") before closing the old-fds.  [Note that
deleting a fd from a bpf map does not necessary mean the fd is closed]

During map_update_elem(),
Only SO_REUSEPORT sk (i.e. which has already been added
to a reuse->socks[]) can be used.  That means a SO_REUSEPORT sk that is
"bind()" for UDP or "bind()+listen()" for TCP.  These conditions are
ensured in "reuseport_array_update_check()".

A SO_REUSEPORT sk can only be added once to a map (i.e. the
same sk cannot be added twice even to the same map).  SO_REUSEPORT
already allows another sk to be created for the same IP:PORT.
There is no need to re-create a similar usage in the BPF side.

When a SO_REUSEPORT is deleted from the "reuse->socks[]" (e.g. "close()"),
it will notify the bpf map to remove it from the map also.  It is
done through "bpf_sk_reuseport_detach()" and it will only be called
if >=1 of the "reuse->sock[]" has ever been added to a bpf map.

The map_update()/map_delete() has to be in-sync with the
"reuse->socks[]".  Hence, the same "reuseport_lock" used
by "reuse->socks[]" has to be used here also. Care has
been taken to ensure the lock is only acquired when the
adding sk passes some strict tests. and
freeing the map does not require the reuseport_lock.

The reuseport_array will also support lookup from the syscall
side.  It will return a sock_gen_cookie().  The sock_gen_cookie()
is on-demand (i.e. a sk's cookie is not generated until the very
first map_lookup_elem()).

The lookup cookie is 64bits but it goes against the logical userspace
expectation on 32bits sizeof(fd) (and as other fd based bpf maps do also).
It may catch user in surprise if we enforce value_size=8 while
userspace still pass a 32bits fd during update.  Supporting different
value_size between lookup and update seems unintuitive also.

We also need to consider what if other existing fd based maps want
to return 64bits value from syscall's lookup in the future.
Hence, reuseport_array supports both value_size 4 and 8, and
assuming user will usually use value_size=4.  The syscall's lookup
will return ENOSPC on value_size=4.  It will will only
return 64bits value from sock_gen_cookie() when user consciously
choose value_size=8 (as a signal that lookup is desired) which then
requires a 64bits value in both lookup and update.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:46 +02:00
Martin KaFai Lau
736b46027e net: Add ID (if needed) to sock_reuseport and expose reuseport_lock
A later patch will introduce a BPF_MAP_TYPE_REUSEPORT_ARRAY which
allows a SO_REUSEPORT sk to be added to a bpf map.  When a sk
is removed from reuse->socks[], it also needs to be removed from
the bpf map.  Also, when adding a sk to a bpf map, the bpf
map needs to ensure it is indeed in a reuse->socks[].
Hence, reuseport_lock is needed by the bpf map to ensure its
map_update_elem() and map_delete_elem() operations are in-sync with
the reuse->socks[].  The BPF_MAP_TYPE_REUSEPORT_ARRAY map will only
acquire the reuseport_lock after ensuring the adding sk is already
in a reuseport group (i.e. reuse->socks[]).  The map_lookup_elem()
will be lockless.

This patch also adds an ID to sock_reuseport.  A later patch
will introduce BPF_PROG_TYPE_SK_REUSEPORT which allows
a bpf prog to select a sk from a bpf map.  It is inflexible to
statically enforce a bpf map can only contain the sk belonging to
a particular reuse->socks[] (i.e. same IP:PORT) during the bpf
verification time. For example, think about the the map-in-map situation
where the inner map can be dynamically changed in runtime and the outer
map may have inner maps belonging to different reuseport groups.
Hence, when the bpf prog (in the new BPF_PROG_TYPE_SK_REUSEPORT
type) selects a sk,  this selected sk has to be checked to ensure it
belongs to the requesting reuseport group (i.e. the group serving
that IP:PORT).

The "sk->sk_reuseport_cb" pointer cannot be used for this checking
purpose because the pointer value will change after reuseport_grow().
Instead of saving all checking conditions like the ones
preced calling "reuseport_add_sock()" and compare them everytime a
bpf_prog is run, a 32bits ID is introduced to survive the
reuseport_grow().  The ID is only acquired if any of the
reuse->socks[] is added to the newly introduced
"BPF_MAP_TYPE_REUSEPORT_ARRAY" map.

If "BPF_MAP_TYPE_REUSEPORT_ARRAY" is not used,  the changes in this
patch is a no-op.

Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:45 +02:00
Martin KaFai Lau
40a1227ea8 tcp: Avoid TCP syncookie rejected by SO_REUSEPORT socket
Although the actual cookie check "__cookie_v[46]_check()" does
not involve sk specific info, it checks whether the sk has recent
synq overflow event in "tcp_synq_no_recent_overflow()".  The
tcp_sk(sk)->rx_opt.ts_recent_stamp is updated every second
when it has sent out a syncookie (through "tcp_synq_overflow()").

The above per sk "recent synq overflow event timestamp" works well
for non SO_REUSEPORT use case.  However, it may cause random
connection request reject/discard when SO_REUSEPORT is used with
syncookie because it fails the "tcp_synq_no_recent_overflow()"
test.

When SO_REUSEPORT is used, it usually has multiple listening
socks serving TCP connection requests destinated to the same local IP:PORT.
There are cases that the TCP-ACK-COOKIE may not be received
by the same sk that sent out the syncookie.  For example,
if reuse->socks[] began with {sk0, sk1},
1) sk1 sent out syncookies and tcp_sk(sk1)->rx_opt.ts_recent_stamp
   was updated.
2) the reuse->socks[] became {sk1, sk2} later.  e.g. sk0 was first closed
   and then sk2 was added.  Here, sk2 does not have ts_recent_stamp set.
   There are other ordering that will trigger the similar situation
   below but the idea is the same.
3) When the TCP-ACK-COOKIE comes back, sk2 was selected.
   "tcp_synq_no_recent_overflow(sk2)" returns true. In this case,
   all syncookies sent by sk1 will be handled (and rejected)
   by sk2 while sk1 is still alive.

The userspace may create and remove listening SO_REUSEPORT sockets
as it sees fit.  E.g. Adding new thread (and SO_REUSEPORT sock) to handle
incoming requests, old process stopping and new process starting...etc.
With or without SO_ATTACH_REUSEPORT_[CB]BPF,
the sockets leaving and joining a reuseport group makes picking
the same sk to check the syncookie very difficult (if not impossible).

The later patches will allow bpf prog more flexibility in deciding
where a sk should be located in a bpf map and selecting a particular
SO_REUSEPORT sock as it sees fit.  e.g. Without closing any sock,
replace the whole bpf reuseport_array in one map_update() by using
map-in-map.  Getting the syncookie check working smoothly across
socks in the same "reuse->socks[]" is important.

A partial solution is to set the newly added sk's ts_recent_stamp
to the max ts_recent_stamp of a reuseport group but that will require
to iterate through reuse->socks[]  OR
pessimistically set it to "now - TCP_SYNCOOKIE_VALID" when a sk is
joining a reuseport group.  However, neither of them will solve the
existing sk getting moved around the reuse->socks[] and that
sk may not have ts_recent_stamp updated, unlikely under continuous
synflood but not impossible.

This patch opts to treat the reuseport group as a whole when
considering the last synq overflow timestamp since
they are serving the same IP:PORT from the userspace
(and BPF program) perspective.

"synq_overflow_ts" is added to "struct sock_reuseport".
The tcp_synq_overflow() and tcp_synq_no_recent_overflow()
will update/check reuse->synq_overflow_ts if the sk is
in a reuseport group.  Similar to the reuseport decision in
__inet_lookup_listener(), both sk->sk_reuseport and
sk->sk_reuseport_cb are tested for SO_REUSEPORT usage.
Update on "synq_overflow_ts" happens at roughly once
every second.

A synflood test was done with a 16 rx-queues and 16 reuseport sockets.
No meaningful performance change is observed.  Before and
after the change is ~9Mpps in IPv4.

Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Martin KaFai Lau <kafai@fb.com>
Acked-by: Alexei Starovoitov <ast@kernel.org>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-11 01:58:45 +02:00
Ursula Braun
0d86caff06 net/smc: send response to test link signal
With SMC-D z/OS sends a test link signal every 10 seconds. Linux is
supposed to answer, otherwise the SMC-D connection breaks.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-10 14:38:43 -07:00
David S. Miller
0780b86666 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2018-08-10

Here's one more (most likely last) bluetooth-next pull request for the
4.19 kernel.

 - Added support for MediaTek serial Bluetooth devices
 - Initial skeleton for controller-side address resolution support
 - Fix BT_HCIUART_RTL related Kconfig dependencies
 - A few other minor fixes/cleanups

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-10 14:24:57 -07:00
David S. Miller
fd685657cd Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following batch contains netfilter updates for your net-next tree:

1) Expose NFT_OSF_MAXGENRELEN maximum OS name length from the new OS
   passive fingerprint matching extension, from Fernando Fernandez.

2) Add extension to support for fine grain conntrack timeout policies
   from nf_tables. As preparation works, this patchset moves
   nf_ct_untimeout() to nf_conntrack_timeout and it also decouples the
   timeout policy from the ctnl_timeout object, most work done by
   Harsha Sharma.

3) Enable connection tracking when conntrack helper is in place.

4) Missing enumeration in uapi header when splitting original xt_osf
   to nfnetlink_osf, also from Fernando.

5) Fix a sparse warning due to incorrect typing in the nf_osf_find(),
   from Wei Yongjun.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-10 10:33:08 -07:00
Ankit Navik
aa12af77aa Bluetooth: Add definitions for LE set address resolution
Add the definitions for LE address resolution enable HCI commands.
When the LE address resolution enable gets changed via HCI commands
make sure that flag gets updated.

Signed-off-by: Ankit Navik <ankit.p.navik@intel.com>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-10 16:57:57 +02:00
Toshiaki Makita
2539650fad xdp: Helpers for disabling napi_direct of xdp_return_frame
We need some mechanism to disable napi_direct on calling
xdp_return_frame_rx_napi() from some context.
When veth gets support of XDP_REDIRECT, it will redirects packets which
are redirected from other devices. On redirection veth will reuse
xdp_mem_info of the redirection source device to make return_frame work.
But in this case .ndo_xdp_xmit() called from veth redirection uses
xdp_mem_info which is not guarded by NAPI, because the .ndo_xdp_xmit()
is not called directly from the rxq which owns the xdp_mem_info.

This approach introduces a flag in bpf_redirect_info to indicate that
napi_direct should be disabled even when _rx_napi variant is used as
well as helper functions to use it.

A NAPI handler who wants to use this flag needs to call
xdp_set_return_frame_no_direct() before processing packets, and call
xdp_clear_return_frame_no_direct() after xdp_do_flush_map() before
exiting NAPI.

v4:
- Use bpf_redirect_info for storing the flag instead of xdp_mem_info to
  avoid per-frame copy cost.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:21 +02:00
Toshiaki Makita
0b19cc0a86 bpf: Make redirect_info accessible from modules
We are going to add kern_flags field in redirect_info for kernel
internal use.
In order to avoid function call to access the flags, make redirect_info
accessible from modules. Also as it is now non-static, add prefix bpf_
to redirect_info.

v6:
- Fix sparse warning around EXPORT_SYMBOL.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:21 +02:00
Toshiaki Makita
b0768a8658 net: Export skb_headers_offset_update
This is needed for veth XDP which does skb_copy_expand()-like operation.

v2:
- Drop skb_copy_header part because it has already been exported now.

Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:12:20 +02:00
Björn Töpel
eb91e4d4db Revert "xdp: add NULL pointer check in __xdp_return()"
This reverts commit 36e0f12bbf.

The reverted commit adds a WARN to check against NULL entries in the
mem_id_ht rhashtable. Any kernel path implementing the XDP (generic or
driver) fast path is required to make a paired
xdp_rxq_info_reg/xdp_rxq_info_unreg call for proper function. In
addition, a driver using a different allocation scheme than the
default MEM_TYPE_PAGE_SHARED is required to additionally call
xdp_rxq_info_reg_mem_model.

For MEM_TYPE_ZERO_COPY, an xdp_rxq_info_reg_mem_model call ensures
that the mem_id_ht rhashtable has a properly inserted allocator id. If
not, this would be a driver bug. A NULL pointer kernel OOPS is
preferred to the WARN.

Suggested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Björn Töpel <bjorn.topel@intel.com>
Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-10 16:06:48 +02:00
Andrei Vagin
4d99f6602c net: allow to call netif_reset_xps_queues() under cpus_read_lock
The definition of static_key_slow_inc() has cpus_read_lock in place. In the
virtio_net driver, XPS queues are initialized after setting the queue:cpu
affinity in virtnet_set_affinity() which is already protected within
cpus_read_lock. Lockdep prints a warning when we are trying to acquire
cpus_read_lock when it is already held.

This patch adds an ability to call __netif_set_xps_queue under
cpus_read_lock().
Acked-by: Jason Wang <jasowang@redhat.com>

============================================
WARNING: possible recursive locking detected
4.18.0-rc3-next-20180703+ #1 Not tainted
--------------------------------------------
swapper/0/1 is trying to acquire lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: static_key_slow_inc+0xe/0x20

but task is already holding lock:
00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0

other info that might help us debug this:
 Possible unsafe locking scenario:

       CPU0
       ----
  lock(cpu_hotplug_lock.rw_sem);
  lock(cpu_hotplug_lock.rw_sem);

 *** DEADLOCK ***

 May be due to missing lock nesting notation

3 locks held by swapper/0/1:
 #0: 00000000244bc7da (&dev->mutex){....}, at: __driver_attach+0x5a/0x110
 #1: 00000000cf973d46 (cpu_hotplug_lock.rw_sem){++++}, at: init_vqs+0x513/0x5a0
 #2: 000000005cd8463f (xps_map_mutex){+.+.}, at: __netif_set_xps_queue+0x8d/0xc60

v2: move cpus_read_lock() out of __netif_set_xps_queue()

Cc: "Nambiar, Amritha" <amritha.nambiar@intel.com>
Cc: "Michael S. Tsirkin" <mst@redhat.com>
Cc: Jason Wang <jasowang@redhat.com>
Fixes: 8af2c06ff4 ("net-sysfs: Add interface for Rx queue(s) map per Tx queue")

Signed-off-by: Andrei Vagin <avagin@gmail.com>

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 14:25:06 -07:00
Jiri Pirko
63cc5bcc9f net: sched: fix block->refcnt decrement
Currently the refcnt is never decremented in case the value is not 1.
Fix it by adding decrement in case the refcnt is not 1.

Reported-by: Vlad Buslov <vladbu@mellanox.com>
Fixes: f71e0ca4db ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 14:12:04 -07:00
YueHaibing
0bab1cdc8c decnet: fix using plain integer as NULL warning
Fixes the following sparse warning:
net/decnet/dn_route.c:407:30: warning: Using plain integer as NULL pointer
net/decnet/dn_route.c:1923:22: warning: Using plain integer as NULL pointer

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Kees Cook <keescook@chromium.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 14:11:24 -07:00
Maria Pasechnik
eb95f52fc7 net: ipv6_gre: Fix GRO to work on IPv6 over GRE tap
IPv6 GRO over GRE tap is not working while GRO is not set
over the native interface.

gro_list_prepare function updates the same_flow variable
of existing sessions to 1 if their mac headers match the one
of the incoming packet.
same_flow is used to filter out non-matching sessions and keep
potential ones for aggregation.

The number of bytes to compare should be the number of bytes
in the mac headers. In gro_list_prepare this number is set to
be skb->dev->hard_header_len. For GRE interfaces this hard_header_len
should be as it is set in the initialization process (when GRE is
created), it should not be overridden. But currently it is being overridden
by the value that is actually supposed to represent the needed_headroom.
Therefore, the number of bytes compared in order to decide whether the
the mac headers are the same is greater than the length of the headers.

As it's documented in netdevice.h, hard_header_len is the maximum
hardware header length, and needed_headroom is the extra headroom
the hardware may need.
hard_header_len is basically all the bytes received by the physical
till layer 3 header of the packet received by the interface.
For example, if the interface is a GRE tap then the needed_headroom
should be the total length of the following headers:
IP header of the physical, GRE header, mac header of GRE.
It is often used to calculate the MTU of the created interface.

This patch removes the override of the hard_header_len, and
assigns the calculated value to needed_headroom.
This way, the comparison in gro_list_prepare is really of
the mac headers, and if the packets have the same mac headers
the same_flow will be set to 1.

Performance testing: 45% higher bandwidth.
Measuring bandwidth of single-stream IPv4 TCP traffic over IPv6
GRE tap while GRO is not set on the native.
NIC: ConnectX-4LX
Before (GRO not working) : 7.2 Gbits/sec
After (GRO working): 10.5 Gbits/sec

Signed-off-by: Maria Pasechnik <mariap@mellanox.com>
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 14:07:38 -07:00
zhong jiang
ac5bb5b3b0 rpc: remove unneeded variable 'ret' in rdma_listen_handler
The ret is not modified after initalization, So just remove the variable
and return 0.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
11b4d66ea3 NFSD: Handle full-length symlinks
I've given up on the idea of zero-copy handling of SYMLINK on the
server side. This is because the Linux VFS symlink API requires the
symlink pathname to be in a NUL-terminated kmalloc'd buffer. The
NUL-termination is going to be problematic (watching out for
landing on a page boundary and dealing with a 4096-byte pathname).

I don't believe that SYMLINK creation is on a performance path or is
requested frequently enough that it will cause noticeable CPU cache
pollution due to data copies.

There will be two places where a transport callout will be necessary
to fill in the rqstp: one will be in the svc_fill_symlink_pathname()
helper that is used by NFSv2 and NFSv3, and the other will be in
nfsd4_decode_create().

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
3fd9557aec NFSD: Refactor the generic write vector fill helper
fill_in_write_vector() is nearly the same logic as
svc_fill_write_vector(), but there are a few differences so that
the former can handle multiple WRITE payloads in a single COMPOUND.

svc_fill_write_vector() can be adjusted so that it can be used in
the NFSv4 WRITE code path too. Instead of assuming the pages are
coming from rq_args.pages, have the caller pass in the page list.

The immediate benefit is a reduction of code duplication. It also
prevents the NFSv4 WRITE decoder from passing an empty vector
element when the transport has provided the payload in the xdr_buf's
page array.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
07d0ff3b0c svcrdma: Clean up Read chunk path
Simplify the error handling at the tail of recv_read_chunk() by
re-arranging rq_pages[] housekeeping and documenting it properly.

NB: In this path, svc_rdma_recvfrom returns zero. Therefore no
subsequent reply processing is done on the svc_rqstp, and thus the
rq_respages field does not need to be updated.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
Chuck Lever
a53d5cb064 svcrdma: Avoid releasing a page in svc_xprt_release()
svc_xprt_release() invokes svc_free_res_pages(), which releases
pages between rq_respages and rq_next_page.

Historically, the RPC/RDMA transport has set these two pointers to
be different by one, which means:

- one page gets released when svc_recv returns 0. This normally
happens whenever one or more RDMA Reads need to be dispatched to
complete construction of an RPC Call.

- one page gets released after every call to svc_send.

In both cases, this released page is immediately refilled by
svc_alloc_arg. There does not seem to be a reason for releasing this
page.

To avoid this unnecessary memory allocator traffic, set rq_next_page
more carefully.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
YueHaibing
9cc3b98d1f sunrpc: remove redundant variables 'checksumlen','blocksize' and 'data'
Variables 'checksumlen','blocksize' and 'data' are being assigned,
but are never used, hence they are redundant and can be removed.

Fix the following warning:

  net/sunrpc/auth_gss/gss_krb5_wrap.c:443:7: warning: variable ‘blocksize’ set but not used [-Wunused-but-set-variable]
  net/sunrpc/auth_gss/gss_krb5_crypto.c:376:15: warning: variable ‘checksumlen’ set but not used [-Wunused-but-set-variable]
  net/sunrpc/xprtrdma/svc_rdma.c:97:9: warning: variable ‘data’ set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Reviewed-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
2018-08-09 16:11:21 -04:00
David S. Miller
a736e07468 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Overlapping changes in RXRPC, changing to ktime_get_seconds() whilst
adding some tracepoints.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-09 11:52:36 -07:00
Andrew Lunn
1be52e97ed dsa: slave: eee: Allow ports to use phylink
For a port to be able to use EEE, both the MAC and the PHY must
support EEE. A phy can be provided by both a phydev or phylink. Verify
at least one of these exist, not just phydev.

Fixes: aab9c4067d ("net: dsa: Plug in PHYLINK support")
Signed-off-by: Andrew Lunn <andrew@lunn.ch>
Reviewed-by: Florian Fainelli <f.fainelli@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 19:19:03 -07:00
Ursula Braun
7311d665ca net/smc: move sock lock in smc_ioctl()
When an SMC socket is connecting it is decided whether fallback to
TCP is needed. To avoid races between connect and ioctl move the
sock lock before the use_fallback check.

Reported-by: syzbot+5b2cece1a8ecb2ca77d8@syzkaller.appspotmail.com
Reported-by: syzbot+19557374321ca3710990@syzkaller.appspotmail.com
Fixes: 1992d99882 ("net/smc: take sock lock in smc_ioctl()")
Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 19:14:22 -07:00
Ursula Braun
bd58c7e086 net/smc: allow sysctl rmem and wmem defaults for servers
Without setsockopt SO_SNDBUF and SO_RCVBUF settings, the sysctl
defaults net.ipv4.tcp_wmem and net.ipv4.tcp_rmem should be the base
for the sizes of the SMC sndbuf and rcvbuf. Any TCP buffer size
optimizations for servers should be ignored.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 19:14:22 -07:00
Ursula Braun
caa21e19e0 net/smc: no shutdown in state SMC_LISTEN
Invoking shutdown for a socket in state SMC_LISTEN does not make
sense. Nevertheless programs like syzbot fuzzing the kernel may
try to do this. For SMC this means a socket refcounting problem.
This patch makes sure a shutdown call for an SMC socket in state
SMC_LISTEN simply returns with -ENOTCONN.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 19:14:22 -07:00
David Howells
330bdcfadc rxrpc: Fix the keepalive generator [ver #2]
AF_RXRPC has a keepalive message generator that generates a message for a
peer ~20s after the last transmission to that peer to keep firewall ports
open.  The implementation is incorrect in the following ways:

 (1) It mixes up ktime_t and time64_t types.

 (2) It uses ktime_get_real(), the output of which may jump forward or
     backward due to adjustments to the time of day.

 (3) If the current time jumps forward too much or jumps backwards, the
     generator function will crank the base of the time ring round one slot
     at a time (ie. a 1s period) until it catches up, spewing out VERSION
     packets as it goes.

Fix the problem by:

 (1) Only using time64_t.  There's no need for sub-second resolution.

 (2) Use ktime_get_seconds() rather than ktime_get_real() so that time
     isn't perceived to go backwards.

 (3) Simplifying rxrpc_peer_keepalive_worker() by splitting it into two
     parts:

     (a) The "worker" function that manages the buckets and the timer.

     (b) The "dispatch" function that takes the pending peers and
     	 potentially transmits a keepalive packet before putting them back
     	 in the ring into the slot appropriate to the revised last-Tx time.

 (4) Taking everything that's pending out of the ring and splicing it into
     a temporary collector list for processing.

     In the case that there's been a significant jump forward, the ring
     gets entirely emptied and then the time base can be warped forward
     before the peers are processed.

     The warping can't happen if the ring isn't empty because the slot a
     peer is in is keepalive-time dependent, relative to the base time.

 (5) Limit the number of iterations of the bucket array when scanning it.

 (6) Set the timer to skip any empty slots as there's no point waking up if
     there's nothing to do yet.

This can be triggered by an incoming call from a server after a reboot with
AF_RXRPC and AFS built into the kernel causing a peer record to be set up
before userspace is started.  The system clock is then adjusted by
userspace, thereby potentially causing the keepalive generator to have a
meltdown - which leads to a message like:

	watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [kworker/0:1:23]
	...
	Workqueue: krxrpcd rxrpc_peer_keepalive_worker
	EIP: lock_acquire+0x69/0x80
	...
	Call Trace:
	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
	 ? _raw_spin_lock_bh+0x29/0x60
	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
	 ? rxrpc_peer_keepalive_worker+0x5e/0x350
	 ? __lock_acquire+0x3d3/0x870
	 ? process_one_work+0x110/0x340
	 ? process_one_work+0x166/0x340
	 ? process_one_work+0x110/0x340
	 ? worker_thread+0x39/0x3c0
	 ? kthread+0xdb/0x110
	 ? cancel_delayed_work+0x90/0x90
	 ? kthread_stop+0x70/0x70
	 ? ret_from_fork+0x19/0x24

Fixes: ace45bec6d ("rxrpc: Fix firewall route keepalive")
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 19:10:26 -07:00
Chuck Lever
8d4fb8ff42 xprtrdma: Fix disconnect regression
I found that injecting disconnects with v4.18-rc resulted in
random failures of the multi-threaded git regression test.

The root cause appears to be that, after a reconnect, the
RPC/RDMA transport is waking pending RPCs before the transport has
posted enough Receive buffers to receive the Replies. If a Reply
arrives before enough Receive buffers are posted, the connection
is dropped. A few connection drops happen in quick succession as
the client and server struggle to regain credit synchronization.

This regression was introduced with commit 7c8d9e7c88 ("xprtrdma:
Move Receive posting to Receive handler"). The client is supposed to
post a single Receive when a connection is established because
it's not supposed to send more than one RPC Call before it gets
a fresh credit grant in the first RPC Reply [RFC 8166, Section
3.3.3].

Unfortunately there appears to be a longstanding bug in the Linux
client's credit accounting mechanism. On connect, it simply dumps
all pending RPC Calls onto the new connection. It's possible it has
done this ever since the RPC/RDMA transport was added to the kernel
ten years ago.

Servers have so far been tolerant of this bad behavior. Currently no
server implementation ever changes its credit grant over reconnects,
and servers always repost enough Receives before connections are
fully established.

The Linux client implementation used to post a Receive before each
of these Calls. This has covered up the flooding send behavior.

I could try to correct this old bug so that the client sends exactly
one RPC Call and waits for a Reply. Since we are so close to the
next merge window, I'm going to instead provide a simple patch to
post enough Receives before a reconnect completes (based on the
number of credits granted to the previous connection).

The spurious disconnects will be gone, but the client will still
send multiple RPC Calls immediately after a reconnect.

Addressing the latter problem will wait for a merge window because
a) I expect it to be a large change requiring lots of testing, and
b) obviously the Linux client has interoperated successfully since
day zero while still being broken.

Fixes: 7c8d9e7c88 ("xprtrdma: Move Receive posting to ... ")
Cc: stable@vger.kernel.org # v4.18+
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
2018-08-08 16:49:47 -04:00
Wei Yongjun
e7ea2a52ff netfilter: nfnetlink_osf: fix using plain integer as NULL warning
Fixes the following sparse warning:

net/netfilter/nfnetlink_osf.c:274:24: warning:
 Using plain integer as NULL pointer

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-08 19:05:39 +02:00
zhong jiang
5a0c6cee17 net:mod: remove unneeded variable 'ret' in init_p9
The ret is modified after initalization, so just remove it and
return 0.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 09:40:44 -07:00
zhong jiang
fb3b467e06 net:af_iucv: get rid of the unneeded variable 'err' in afiucv_pm_freeze
We will not use the variable 'err' after initalization, So remove it and
return 0.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-08 09:39:36 -07:00
Cong Wang
0dcb82254d llc: use refcount_inc_not_zero() for llc_sap_find()
llc_sap_put() decreases the refcnt before deleting sap
from the global list. Therefore, there is a chance
llc_sap_find() could find a sap with zero refcnt
in this global list.

Close this race condition by checking if refcnt is zero
or not in llc_sap_find(), if it is zero then it is being
removed so we can just treat it as gone.

Reported-by: <syzbot+278893f3f7803871f7ce@syzkaller.appspotmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 15:54:00 -07:00
Alexey Kodanev
61ef4b07fc dccp: fix undefined behavior with 'cwnd' shift in ccid2_cwnd_restart()
The shift of 'cwnd' with '(now - hc->tx_lsndtime) / hc->tx_rto' value
can lead to undefined behavior [1].

In order to fix this use a gradual shift of the window with a 'while'
loop, similar to what tcp_cwnd_restart() is doing.

When comparing delta and RTO there is a minor difference between TCP
and DCCP, the last one also invokes dccp_cwnd_restart() and reduces
'cwnd' if delta equals RTO. That case is preserved in this change.

[1]:
[40850.963623] UBSAN: Undefined behaviour in net/dccp/ccids/ccid2.c:237:7
[40851.043858] shift exponent 67 is too large for 32-bit type 'unsigned int'
[40851.127163] CPU: 3 PID: 15940 Comm: netstress Tainted: G        W   E     4.18.0-rc7.x86_64 #1
...
[40851.377176] Call Trace:
[40851.408503]  dump_stack+0xf1/0x17b
[40851.451331]  ? show_regs_print_info+0x5/0x5
[40851.503555]  ubsan_epilogue+0x9/0x7c
[40851.548363]  __ubsan_handle_shift_out_of_bounds+0x25b/0x2b4
[40851.617109]  ? __ubsan_handle_load_invalid_value+0x18f/0x18f
[40851.686796]  ? xfrm4_output_finish+0x80/0x80
[40851.739827]  ? lock_downgrade+0x6d0/0x6d0
[40851.789744]  ? xfrm4_prepare_output+0x160/0x160
[40851.845912]  ? ip_queue_xmit+0x810/0x1db0
[40851.895845]  ? ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40851.963530]  ccid2_hc_tx_packet_sent+0xd36/0x10a0 [dccp]
[40852.029063]  dccp_xmit_packet+0x1d3/0x720 [dccp]
[40852.086254]  dccp_write_xmit+0x116/0x1d0 [dccp]
[40852.142412]  dccp_sendmsg+0x428/0xb20 [dccp]
[40852.195454]  ? inet_dccp_listen+0x200/0x200 [dccp]
[40852.254833]  ? sched_clock+0x5/0x10
[40852.298508]  ? sched_clock+0x5/0x10
[40852.342194]  ? inet_create+0xdf0/0xdf0
[40852.388988]  sock_sendmsg+0xd9/0x160
...

Fixes: 113ced1f52 ("dccp ccid-2: Perform congestion-window validation")
Signed-off-by: Alexey Kodanev <alexey.kodanev@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 15:34:44 -07:00
YueHaibing
5941923da2 RDS: IB: fix 'passing zero to ERR_PTR()' warning
Fix a static code checker warning:
 net/rds/ib_frmr.c:82 rds_ib_alloc_frmr() warn: passing zero to 'ERR_PTR'

The error path for ib_alloc_mr failure should set err to PTR_ERR.

Fixes: 1659185fb4 ("RDS: IB: Support Fastreg MR (FRMR) memory registration mode")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 13:19:45 -07:00
Ying Xue
37436d9c0e tipc: fix an interrupt unsafe locking scenario
Commit 9faa89d4ed ("tipc: make function tipc_net_finalize() thread
safe") tries to make it thread safe to set node address, so it uses
node_list_lock lock to serialize the whole process of setting node
address in tipc_net_finalize(). But it causes the following interrupt
unsafe locking scenario:

       CPU0                    CPU1
       ----                    ----
  rht_deferred_worker()
  rhashtable_rehash_table()
  lock(&(&ht->lock)->rlock)
			       tipc_nl_compat_doit()
                               tipc_net_finalize()
                               local_irq_disable();
                               lock(&(&tn->node_list_lock)->rlock);
                               tipc_sk_reinit()
                               rhashtable_walk_enter()
                               lock(&(&ht->lock)->rlock);
  <Interrupt>
  tipc_disc_rcv()
  tipc_node_check_dest()
  tipc_node_create()
  lock(&(&tn->node_list_lock)->rlock);

 *** DEADLOCK ***

When rhashtable_rehash_table() holds ht->lock on CPU0, it doesn't
disable BH. So if an interrupt happens after the lock, it can create
an inverse lock ordering between ht->lock and tn->node_list_lock. As
a consequence, deadlock might happen.

The reason causing the inverse lock ordering scenario above is because
the initial purpose of node_list_lock is not designed to do the
serialization of node address setting.

As cmpxchg() can guarantee CAS (compare-and-swap) process is atomic,
we use it to replace node_list_lock to ensure setting node address can
be atomically finished. It turns out the potential deadlock can be
avoided as well.

Fixes: 9faa89d4ed ("tipc: make function tipc_net_finalize() thread safe")
Signed-off-by: Ying Xue <ying.xue@windriver.com>
Acked-by: Jon Maloy <maloy@donjonn.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 13:15:35 -07:00
Cong Wang
455f05ecd2 vsock: split dwork to avoid reinitializations
syzbot reported that we reinitialize an active delayed
work in vsock_stream_connect():

	ODEBUG: init active (active state 0) object type: timer_list hint:
	delayed_work_timer_fn+0x0/0x90 kernel/workqueue.c:1414
	WARNING: CPU: 1 PID: 11518 at lib/debugobjects.c:329
	debug_print_object+0x16a/0x210 lib/debugobjects.c:326

The pattern is apparently wrong, we should only initialize
the dealyed work once and could repeatly schedule it. So we
have to move out the initializations to allocation side.
And to avoid confusion, we can split the shared dwork
into two, instead of re-using the same one.

Fixes: d021c34405 ("VSOCK: Introduce VM Sockets")
Reported-by: <syzbot+8a9b1bd330476a4f3db6@syzkaller.appspotmail.com>
Cc: Andy king <acking@vmware.com>
Cc: Stefan Hajnoczi <stefanha@redhat.com>
Cc: Jorgen Hansen <jhansen@vmware.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:39:13 -07:00
Shmulik Ladkani
3789cabaab ip6_tunnel: collect_md xmit: Use ip_tunnel_key's provided src address
When using an ip6tnl device in collect_md mode, the xmit methods ignore
the ipv6.src field present in skb_tunnel_info's key, both for route
calculation purposes (flowi6 construction) and for assigning the
packet's final ipv6h->saddr.

This makes it impossible specifying a desired ipv6 local address in the
encapsulating header (for example, when using tc action tunnel_key).

This is also not aligned with behavior of ipip (ipv4) in collect_md
mode, where the key->u.ipv4.src gets used.

Fix, by assigning fl6.saddr with given key->u.ipv6.src.
In case ipv6.src is not specified, ip6_tnl_xmit uses existing saddr
selection code.

Fixes: 8d79266bc4 ("ip6_tunnel: add collect_md mode to IPv6 tunnels")
Signed-off-by: Shmulik Ladkani <shmulik.ladkani@gmail.com>
Reviewed-by: Eyal Birger <eyal.birger@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:36:46 -07:00
Vlad Buslov
9ca6163005 net: sched: cls_flower: set correct offload data in fl_reoffload
fl_reoffload implementation sets following members of struct
tc_cls_flower_offload incorrectly:
 - masked key instead of mask
 - key instead of masked key

Fix fl_reoffload to provide correct data to offload callback.

Fixes: 31533cba43 ("net: sched: cls_flower: implement offload tcf_proto_op")
Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
Acked-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:35:17 -07:00
Pieter Jansen van Vuuren
0a6e77784f net/sched: allow flower to match tunnel options
Allow matching on options in Geneve tunnel headers.
This makes use of existing tunnel metadata support.

The options can be described in the form
CLASS:TYPE:DATA/CLASS_MASK:TYPE_MASK:DATA_MASK, where CLASS is
represented as a 16bit hexadecimal value, TYPE as an 8bit
hexadecimal value and DATA as a variable length hexadecimal value.

e.g.
 # ip link add name geneve0 type geneve dstport 0 external
 # tc qdisc add dev geneve0 ingress
 # tc filter add dev geneve0 protocol ip parent ffff: \
     flower \
       enc_src_ip 10.0.99.192 \
       enc_dst_ip 10.0.99.193 \
       enc_key_id 11 \
       geneve_opts 0102:80:1122334421314151/ffff:ff:ffffffffffffffff \
       ip_proto udp \
       action mirred egress redirect dev eth1

This patch adds support for matching Geneve options in the order
supplied by the user. This leads to an efficient implementation in
the software datapath (and in our opinion hardware datapaths that
offload this feature). It is also compatible with Geneve options
matching provided by the Open vSwitch kernel datapath which is
relevant here as the Flower classifier may be used as a mechanism
to program flows into hardware as a form of Open vSwitch datapath
offload (sometimes referred to as OVS-TC). The netlink
Kernel/Userspace API may be extended, for example by adding a flag,
if other matching options are desired, for example matching given
options in any order. This would require an implementation in the
TC software datapath. And be done in a way that drivers that
facilitate offload of the Flower classifier can reject or accept
such flows based on hardware datapath capabilities.

This approach was discussed and agreed on at Netconf 2017 in Seoul.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:22:15 -07:00
Simon Horman
92e2c40536 flow_dissector: allow dissection of tunnel options from metadata
Allow the existing 'dissection' of tunnel metadata to 'dissect'
options already present in tunnel metadata. This dissection is
controlled by a new dissector key, FLOW_DISSECTOR_KEY_ENC_OPTS.

This dissection only occurs when skb_flow_dissect_tunnel_info()
is called, currently only the Flower classifier makes that call.
So there should be no impact on other users of the flow dissector.

This is in preparation for allowing the flower classifier to
match on Geneve options.

Signed-off-by: Simon Horman <simon.horman@netronome.com>
Signed-off-by: Pieter Jansen van Vuuren <pieter.jansenvanvuuren@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 12:22:14 -07:00
David S. Miller
1ba982806c Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next
Daniel Borkmann says:

====================
pull-request: bpf-next 2018-08-07

The following pull-request contains BPF updates for your *net-next* tree.

The main changes are:

1) Add cgroup local storage for BPF programs, which provides a fast
   accessible memory for storing various per-cgroup data like number
   of transmitted packets, etc, from Roman.

2) Support bpf_get_socket_cookie() BPF helper in several more program
   types that have a full socket available, from Andrey.

3) Significantly improve the performance of perf events which are
   reported from BPF offload. Also convert a couple of BPF AF_XDP
   samples overto use libbpf, both from Jakub.

4) seg6local LWT provides the End.DT6 action, which allows to
   decapsulate an outer IPv6 header containing a Segment Routing Header.
   Adds this action now to the seg6local BPF interface, from Mathieu.

5) Do not mark dst register as unbounded in MOV64 instruction when
   both src and dst register are the same, from Arthur.

6) Define u_smp_rmb() and u_smp_wmb() to their respective barrier
   instructions on arm64 for the AF_XDP sample code, from Brian.

7) Convert the tcp_client.py and tcp_server.py BPF selftest scripts
   over from Python 2 to Python 3, from Jeremy.

8) Enable BTF build flags to the BPF sample code Makefile, from Taeung.

9) Remove an unnecessary rcu_read_lock() in run_lwt_bpf(), from Taehee.

10) Several improvements to the README.rst from the BPF documentation
    to make it more consistent with RST format, from Tobin.

11) Replace all occurrences of strerror() by calls to strerror_r()
    in libbpf and fix a FORTIFY_SOURCE build error along with it,
    from Thomas.

12) Fix a bug in bpftool's get_btf() function to correctly propagate
    an error via PTR_ERR(), from Yue.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-07 11:02:05 -07:00
Pablo Neira Ayuso
f699edb12a netfilter: nft_ct: enable conntrack for helpers
Enable conntrack if the user defines a helper to be used from the
ruleset policy.

Fixes: 1a64edf54f ("netfilter: nft_ct: add helper set support")
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:27 +02:00
Harsha Sharma
7e0b2b57f0 netfilter: nft_ct: add ct timeout support
This patch allows to add, list and delete connection tracking timeout
policies via nft objref infrastructure and assigning these timeout
via nft rule.

%./libnftnl/examples/nft-ct-timeout-add ip raw cttime tcp

Ruleset:

table ip raw {
   ct timeout cttime {
       protocol tcp;
       policy = {established: 111, close: 13 }
   }

   chain output {
       type filter hook output priority -300; policy accept;
       ct timeout set "cttime"
   }
}

%./libnftnl/examples/nft-rule-ct-timeout-add ip raw output cttime

%conntrack -E
[NEW] tcp      6 111 ESTABLISHED src=172.16.19.128 dst=172.16.19.1
sport=22 dport=41360 [UNREPLIED] src=172.16.19.1 dst=172.16.19.128
sport=41360 dport=22

%nft delete rule ip raw output handle <handle>
%./libnftnl/examples/nft-ct-timeout-del ip raw cttime

Joint work with Pablo Neira.

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:23 +02:00
Pablo Neira Ayuso
6c1fd7dc48 netfilter: cttimeout: decouple timeout policy from nfnetlink_cttimeout object
The timeout policy is currently embedded into the nfnetlink_cttimeout
object, move the policy into an independent object. This allows us to
reuse part of the existing conntrack timeout extension from nf_tables
without adding dependencies with the nfnetlink_cttimeout object layout.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:15 +02:00
Harsha Sharma
4e665afbd7 netfilter: cttimeout: move ctnl_untimeout to nf_conntrack
As, ctnl_untimeout is required by nft_ct, so move ctnl_timeout from
nfnetlink_cttimeout to nf_conntrack_timeout and rename as nf_ct_timeout.

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:10 +02:00
Fernando Fernandez Mancera
35a8a3bd1c netfilter: nft_osf: use NFT_OSF_MAXGENRELEN instead of IFNAMSIZ
As no "genre" on pf.os exceed 16 bytes of length, we reduce
NFT_OSF_MAXGENRELEN parameter to 16 bytes and use it instead of IFNAMSIZ.

Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-07 17:14:04 +02:00
Willem de Bruijn
4576cd469d packet: refine ring v3 block size test to hold one frame
TPACKET_V3 stores variable length frames in fixed length blocks.
Blocks must be able to store a block header, optional private space
and at least one minimum sized frame.

Frames, even for a zero snaplen packet, store metadata headers and
optional reserved space.

In the block size bounds check, ensure that the frame of the
chosen configuration fits. This includes sockaddr_ll and optional
tp_reserve.

Syzbot was able to construct a ring with insuffient room for the
sockaddr_ll in the header of a zero-length frame, triggering an
out-of-bounds write in dev_parse_header.

Convert the comparison to less than, as zero is a valid snap len.
This matches the test for minimum tp_frame_size immediately below.

Fixes: f6fb8f100b ("af-packet: TPACKET_V3 flexible buffer implementation.")
Fixes: eb73190f4f ("net/packet: refine check for priv area size")
Reported-by: syzbot <syzkaller@googlegroups.com>
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06 13:48:33 -07:00
David S. Miller
de7de576ec Merge branch 'ieee802154-for-davem-2018-08-06' of git://git.kernel.org/pub/scm/linux/kernel/git/sschmidt/wpan-next
Stefan Schmidt says:

====================
pull-request: ieee802154-next 2018-08-06

An update from ieee802154 for *net-next*

Romuald added a socket option to get the LQI value of the received datagram.
Alexander added a new hardware simulation driver modelled after hwsim of the
wireless people. It allows runtime configuration for new nodes and edges over a
netlink interface (a config utlity is making its way into wpan-tools).
We also have three fixes in here. One from Colin which is more of a cleanup and
two from Alex fixing tailroom and single frame space problems.
I would normally put the last two into my fixes tree, but given we are already
in -rc8 I simply put them here and added a cc: stable to them.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06 13:17:48 -07:00
Dan Carpenter
70837ffe30 ipv4: frags: precedence bug in ip_expire()
We accidentally removed the parentheses here, but they are required
because '!' has higher precedence than '&'.

Fixes: fa0f527358 ("ip: use rb trees for IP frag queue.")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06 13:15:12 -07:00
Yafang Shao
9dae34978d net: avoid unnecessary sock_flag() check when enable timestamp
The sock_flag() check is alreay inside sock_enable_timestamp(), so it is
unnecessary checking it in the caller.

    void sock_enable_timestamp(struct sock *sk, int flag)
    {
        if (!sock_flag(sk, flag)) {
            ...
        }
    }

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06 10:42:48 -07:00
zhong jiang
9c2e955c48 net/bridge/br_multicast: remove redundant variable "err"
The err is not modified after initalization, So remove it and make
it to be void function.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-06 10:33:44 -07:00
YueHaibing
ad3e0b2f3c Bluetooth: remove redundant variables 'adv_set' and 'cp'
Variables 'adv_set' and 'cp'  are being assigned but are never used hence
they are redundant and can be removed.

Cleans up clang warnings:
net/bluetooth/hci_event.c:1135:29: warning: variable 'adv_set' set but not used [-Wunused-but-set-variable]
net/bluetooth/mgmt.c:3359:39: warning: variable 'cp' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Johan Hedberg <johan.hedberg@intel.com>
2018-08-06 17:06:58 +03:00
Colin Ian King
4e54acb202 net: ieee802154: 6lowpan: remove redundant pointers 'fq' and 'net'
Pointers fq and net are being assigned but are never used hence they
are redundant and can be removed.

Cleans up clang warnings:
warning: variable 'fq' set but not used [-Wunused-but-set-variable]
warning: variable 'net' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2018-08-06 11:21:15 +02:00
Alexander Aring
f9c5283113 net: mac802154: tx: expand tailroom if necessary
This patch is necessary if case of AF_PACKET or other socket interface
which I am aware of it and didn't allocated the necessary room.

Reported-by: David Palma <david.palma@ntnu.no>
Reported-by: Rabi Narayan Sahoo <rabinarayans0828@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2018-08-06 11:21:37 +02:00
Alexander Aring
ac74f87c78 net: 6lowpan: fix reserved space for single frames
This patch fixes patch add handling to take care tail and headroom for
single 6lowpan frames. We need to be sure we have a skb with the right
head and tailroom for single frames. This patch do it by using
skb_copy_expand() if head and tailroom is not enough allocated by upper
layer.

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=195059
Reported-by: David Palma <david.palma@ntnu.no>
Reported-by: Rabi Narayan Sahoo <rabinarayans0828@gmail.com>
Cc: stable@vger.kernel.org
Signed-off-by: Alexander Aring <aring@mojatatu.com>
Signed-off-by: Stefan Schmidt <stefan@datenfreihafen.org>
2018-08-06 11:21:15 +02:00
Stefan Schmidt
a304610803 Merge remote-tracking branch 'net-next/master' 2018-08-06 09:04:48 +02:00
Xin Long
82a40777de ip6_tunnel: use the right value for ipv4 min mtu check in ip6_tnl_xmit
According to RFC791, 68 bytes is the minimum size of IPv4 datagram every
device must be able to forward without further fragmentation while 576
bytes is the minimum size of IPv4 datagram every device has to be able
to receive, so in ip6_tnl_xmit(), 68(IPV4_MIN_MTU) should be the right
value for the ipv4 min mtu check in ip6_tnl_xmit.

While at it, change to use max() instead of if statement.

Fixes: c9fefa0819 ("ip6_tunnel: get the min mtu properly in ip6_tnl_xmit")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:35:02 -07:00
David S. Miller
6277547f33 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2018-08-05

Here's the main bluetooth-next pull request for the 4.19 kernel.

 - Added support for Bluetooth Advertising Extensions
 - Added vendor driver support to hci_h5 HCI driver
 - Added serdev support to hci_h5 driver
 - Added support for Qualcomm wcn3990 controller
 - Added support for RTL8723BS and RTL8723DS controllers
 - btusb: Added new ID for Realtek 8723DE
 - Several other smaller fixes & cleanups

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:29:27 -07:00
Dan Carpenter
1cbc36a53b net: sched: cls_flower: Fix an error code in fl_tmplt_create()
We forgot to set the error code on this path, so we return NULL instead
of an error pointer.  In the current code kzalloc() won't fail for small
allocations so this doesn't really affect runtime.

Fixes: b95ec7eb3b ("net: sched: cls_flower: implement chain templates")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:25:46 -07:00
Li RongQing
a6bcfc8969 net: check extack._msg before print
dev_set_mtu_ext is able to fail with a valid mtu value, at that
condition, extack._msg is not set and random since it is in stack,
then kernel will crash when print it.

Fixes: 7a4c53bee3 ("net: report invalid mtu value via netlink extack")
Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:24:45 -07:00
Cong Wang
e70a3aad44 ipv6: fix double refcount of fib6_metrics
All the callers of ip6_rt_copy_init()/rt6_set_from() hold refcnt
of the "from" fib6_info, so there is no need to hold fib6_metrics
refcnt again, because fib6_metrics refcnt is only released when
fib6_info is gone, that is, they have the same life time, so the
whole fib6_metrics refcnt can be removed actually.

This fixes a kmemleak warning reported by Sabrina.

Fixes: 93531c6743 ("net/ipv6: separate handling of FIB entries from dst based routes")
Reported-by: Sabrina Dubroca <sd@queasysnail.net>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:22:45 -07:00
Florian Westphal
0ed4229b08 ipv6: defrag: drop non-last frags smaller than min mtu
don't bother with pathological cases, they only waste cycles.
IPv6 requires a minimum MTU of 1280 so we should never see fragments
smaller than this (except last frag).

v3: don't use awkward "-offset + len"
v2: drop IPv4 part, which added same check w. IPV4_MIN_MTU (68).
    There were concerns that there could be even smaller frags
    generated by intermediate nodes, e.g. on radio networks.

Cc: Peter Oskolkov <posk@google.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:21:14 -07:00
Peter Oskolkov
fa0f527358 ip: use rb trees for IP frag queue.
Similar to TCP OOO RX queue, it makes sense to use rb trees to store
IP fragments, so that OOO fragments are inserted faster.

Tested:

- a follow-up patch contains a rather comprehensive ip defrag
  self-test (functional)
- ran neper `udp_stream -c -H <host> -F 100 -l 300 -T 20`:
    netstat --statistics
    Ip:
        282078937 total packets received
        0 forwarded
        0 incoming packets discarded
        946760 incoming packets delivered
        18743456 requests sent out
        101 fragments dropped after timeout
        282077129 reassemblies required
        944952 packets reassembled ok
        262734239 packet reassembles failed
   (The numbers/stats above are somewhat better re:
    reassemblies vs a kernel without this patchset. More
    comprehensive performance testing TBD).

Reported-by: Jann Horn <jannh@google.com>
Reported-by: Juha-Matti Tilli <juha-matti.tilli@iki.fi>
Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:16:46 -07:00
Peter Oskolkov
385114dec8 net: modify skb_rbtree_purge to return the truesize of all purged skbs.
Tested: see the next patch is the series.

Suggested-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:16:46 -07:00
Peter Oskolkov
7969e5c40d ip: discard IPv4 datagrams with overlapping segments.
This behavior is required in IPv6, and there is little need
to tolerate overlapping fragments in IPv4. This change
simplifies the code and eliminates potential DDoS attack vectors.

Tested: ran ip_defrag selftest (not yet available uptream).

Suggested-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Peter Oskolkov <posk@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Florian Westphal <fw@strlen.de>
Acked-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:16:46 -07:00
Vakul Garg
cfb4099fb4 net/tls: Mark the end in scatterlist table
Function zerocopy_from_iter() unmarks the 'end' in input sgtable while
adding new entries in it. The last entry in sgtable remained unmarked.
This results in KASAN error report on using apis like sg_nents(). Before
returning, the function needs to mark the 'end' in the last entry it
adds.

Signed-off-by: Vakul Garg <vakul.garg@nxp.com>
Acked-by: Dave Watson <davejwatson@fb.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:13:58 -07:00
Georg Kohmann
5f379ef51b ipv6: icmp: Updating pmtu for link local route
When a ICMPV6_PKT_TOOBIG is received from a link local address the pmtu will
be updated on a route with an arbitrary interface index. Subsequent packets
sent back to the same link local address may therefore end up not
considering the updated pmtu.

Current behavior breaks TAHI v6LC4.1.4 Reduce PMTU On-link. Referring to RFC
1981: Section 3: "Note that Path MTU Discovery must be performed even in
cases where a node "thinks" a destination is attached to the same link as
itself. In a situation such as when a neighboring router acts as proxy [ND]
for some destination, the destination can to appear to be directly
connected but is in fact more than one hop away."

Using the interface index from the incoming ICMPV6_PKT_TOOBIG when updating
the pmtu.

Signed-off-by: Georg Kohmann <geokohma@cisco.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 17:12:44 -07:00
David S. Miller
074fb88016 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter updates for net-next

The following patchset contains Netfilter updates for your net-next tree:

1) Support for transparent proxying for nf_tables, from Mate Eckl.

2) Patchset to add OS passive fingerprint recognition for nf_tables,
   from Fernando Fernandez. This takes common code from xt_osf and
   place it into the new nfnetlink_osf module for codebase sharing.

3) Lightweight tunneling support for nf_tables.

4) meta and lookup are likely going to be used in rulesets, make them
   direct calls. From Florian Westphal.

A bunch of incremental updates:

5) use PTR_ERR_OR_ZERO() from nft_numgen, from YueHaibing.

6) Use kvmalloc_array() to allocate hashtables, from Li RongQing.

7) Explicit dependencies between nfnetlink_cttimeout and conntrack
   timeout extensions, from Harsha Sharma.

8) Simplify NLM_F_CREATE handling in nf_tables.

9) Removed unused variable in the get element command, from
   YueHaibing.

10) Expose bridge hook priorities through uapi, from Mate Eckl.

And a few fixes for previous Netfilter batch for net-next:

11) Use per-netns mutex from flowtable event, from Florian Westphal.

12) Remove explicit dependency on iptables CT target from conntrack
    zones, from Florian.

13) Fix use-after-free in rmmod nf_conntrack path, also from Florian.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 16:25:22 -07:00
David S. Miller
c1c8626fce Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
Lots of overlapping changes, mostly trivial in nature.

The mlxsw conflict was resolving using the example
resolution at:

https://github.com/jpirko/linux_mlxsw/blob/combined_queue/drivers/net/ethernet/mellanox/mlxsw/core_acl_flex_actions.c

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-05 13:04:31 -07:00
Dmitry Safonov
91874ecf32 netlink: Don't shift on 64 for ngroups
It's legal to have 64 groups for netlink_sock.

As user-supplied nladdr->nl_groups is __u32, it's possible to subscribe
only to first 32 groups.

The check for correctness of .bind() userspace supplied parameter
is done by applying mask made from ngroups shift. Which broke Android
as they have 64 groups and the shift for mask resulted in an overflow.

Fixes: 61f4b23769 ("netlink: Don't shift with UB on nlk->ngroups")
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Herbert Xu <herbert@gondor.apana.org.au>
Cc: Steffen Klassert <steffen.klassert@secunet.com>
Cc: netdev@vger.kernel.org
Cc: stable@vger.kernel.org
Reported-and-Tested-by: Nathan Chancellor <natechancellor@gmail.com>
Signed-off-by: Dmitry Safonov <dima@arista.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 17:52:51 -07:00
David S. Miller
5dbfb6eca0 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf
Daniel Borkmann says:

====================
pull-request: bpf 2018-08-05

The following pull-request contains BPF updates for your *net* tree.

The main changes are:

1) Fix bpftool percpu_array dump by using correct roundup to next
   multiple of 8 for the value size, from Yonghong.

2) Fix in AF_XDP's __xsk_rcv_zc() to not returning frames back to
   allocator since driver will recycle frame anyway in case of an
   error, from Jakub.

3) Fix up BPF test_lwt_seg6local test cases to final iproute2
   syntax, from Mathieu.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 17:51:55 -07:00
Ursula Braun
5607016cd1 net/smc: no cursor update send in state SMC_INIT
If a writer blocked condition is received without data, the current
consumer cursor is immediately sent. Servers could already receive this
condition in state SMC_INIT without finished tx-setup. This patch
avoids sending a consumer cursor update in this case.

Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 13:09:33 -07:00
zhong jiang
07d53ae4fb net: Remove some unneeded semicolon
These semicolons are not needed.  Just remove them.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-04 13:05:39 -07:00
YueHaibing
a01512b14d tcp: remove unneeded variable 'err'
variable 'err' is unmodified after initalization,
so simply cleans up it and returns 0.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 16:52:07 -07:00
Jason Baron
51f7e95187 af_unix: ensure POLLOUT on remote close() for connected dgram socket
Applications use -ECONNREFUSED as returned from write() in order to
determine that a socket should be closed. However, when using connected
dgram unix sockets in a poll/write loop, a final POLLOUT event can be
missed when the remote end closes. Thus, the poll is stuck forever:

          thread 1 (client)                   thread 2 (server)

connect() to server
write() returns -EAGAIN
unix_dgram_poll()
 -> unix_recvq_full() is true
                                       close()
                                        ->unix_release_sock()
                                         ->wake_up_interruptible_all()
unix_dgram_poll() (due to the
     wake_up_interruptible_all)
 -> unix_recvq_full() still is true
                                         ->free all skbs

Now thread 1 is stuck and will not receive anymore wakeups. In this
case, when thread 1 gets the -EAGAIN, it has not queued any skbs
otherwise the 'free all skbs' step would in fact cause a wakeup and
a POLLOUT return. So the race here is probably fairly rare because
it means there are no skbs that thread 1 queued and that thread 1
schedules before the 'free all skbs' step.

This issue was reported as a hang when /dev/log is closed.

The fix is to signal POLLOUT if the socket is marked as SOCK_DEAD, which
means a subsequent write() will get -ECONNREFUSED.

Reported-by: Ian Lance Taylor <iant@golang.org>
Cc: David Rientjes <rientjes@google.com>
Cc: Rainer Weikusat <rweikusat@mobileactivedefense.com>
Cc: Eric Dumazet <edumazet@google.com>
Signed-off-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 16:44:19 -07:00
Pablo Neira Ayuso
483f3fdcc7 netfilter: nft_tunnel: fix sparse errors
[...]
net/netfilter/nft_tunnel.c:117:25:    expected unsigned int [unsigned] [usertype] flags
net/netfilter/nft_tunnel.c:117:25:    got restricted __be16 [usertype] <noident>
[...]
net/netfilter/nft_tunnel.c:246:33:    expected restricted __be16 [addressable] [assigned] [usertype] tp_dst
net/netfilter/nft_tunnel.c:246:33:    got int

Fixes: af308b94a2 ("netfilter: nf_tables: add tunnel support")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-04 00:53:29 +02:00
David Howells
eb9950eb31 rxrpc: Push iov_iter up from rxrpc_kernel_recv_data() to caller
Push iov_iter up from rxrpc_kernel_recv_data() to its caller to allow
non-contiguous iovs to be passed down, thereby permitting file reading to
be simplified in the AFS filesystem in a future patch.

Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 12:46:20 -07:00
Guillaume Nault
f664e37dcc l2tp: fix missing refcount drop in pppol2tp_tunnel_ioctl()
If 'session' is not NULL and is not a PPP pseudo-wire, then we fail to
drop the reference taken by l2tp_session_get().

Fixes: ecd012e45a ("l2tp: filter out non-PPP sessions in pppol2tp_tunnel_ioctl()")
Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 12:39:01 -07:00
Florian Westphal
020f6cc5f7 netfilter: conntrack: avoid use-after free on rmmod
When the conntrack module is removed, we call nf_ct_iterate_destroy via
nf_ct_l4proto_unregister().

Problem is that nf_conntrack_proto_fini() gets called after the
conntrack hash table has already been freed.

Just remove the l4proto unregister call, its unecessary as the
nf_ct_protos[] array gets free'd right after anyway.

v2: add comment wrt. missing unreg call.

Fixes: a0ae2562c6 ("netfilter: conntrack: remove l3proto abstraction")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:15:13 +02:00
Florian Westphal
7bdfcea875 netfilter: kconfig: remove ct zone/label dependencies
connection tracking zones currently depend on the xtables CT target.
The reasoning was that it makes no sense to support zones if they can't
be configured (which needed CT target).

Nowadays zones can also be used by OVS and configured via nftables,
so remove the dependency.

connection tracking labels are handled via hidden dependency that gets
auto-selected by the connlabel match.
Make it a visible knob, as labels can be attached via ctnetlink
or via nftables rules (nft_ct expression) too.

This allows to use conntrack labels and zones with nftables-only build.

Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:15:12 +02:00
Pablo Neira Ayuso
445509eb9b netfilter: nf_tables: simplify NLM_F_CREATE handling
* From nf_tables_newchain(), codepath provides context that allows us to
  infer if we are updating a chain (in that case, no module autoload is
  required) or adding a new one (then, module autoload is indeed
  needed).
* We only need it in one single spot in nf_tables_newrule().
* Not needed for nf_tables_newset() at all.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:15:11 +02:00
Máté Eckl
94276fa8a2 netfilter: bridge: Expose nf_tables bridge hook priorities through uapi
Netfilter exposes standard hook priorities in case of ipv4, ipv6 and
arp but not in case of bridge.

This patch exposes the hook priority values of the bridge family (which are
different from the formerly mentioned) via uapi so that they can be used by
user-space applications just like the others.

Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:15:09 +02:00
Pablo Neira Ayuso
aaecfdb5c5 netfilter: nf_tables: match on tunnel metadata
This patch allows us to match on the tunnel metadata that is available
of the packet. We can use this to validate if the packet comes from/goes
to tunnel and the corresponding tunnel ID.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:12:19 +02:00
Pablo Neira Ayuso
af308b94a2 netfilter: nf_tables: add tunnel support
This patch implements the tunnel object type that can be used to
configure tunnels via metadata template through the existing lightweight
API from the ingress path.

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 21:12:12 +02:00
Máté Eckl
033eab53ff netfilter: nft_tproxy: Add missing config check
A config check was missing form the code when using
nf_defrag_ipv6_enable with NFT_TPROXY != n and NF_DEFRAG_IPV6 = n and
this caused the following error:

../net/netfilter/nft_tproxy.c: In function 'nft_tproxy_init':
../net/netfilter/nft_tproxy.c:237:3: error: implicit declaration of function
+'nf_defrag_ipv6_enable' [-Werror=implicit-function-declaration]
   err = nf_defrag_ipv6_enable(ctx->net);

This patch adds a check for NF_TABLES_IPV6 when NF_DEFRAG_IPV6 is
selected by Kconfig.

Reported-by: Randy Dunlap <rdunlap@infradead.org>
Fixes: 4ed8eb6570 ("netfilter: nf_tables: Add native tproxy support")
Signed-off-by: Máté Eckl <ecklm94@gmail.com>
Acked-by: Randy Dunlap <rdunlap@infradead.org>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 20:20:53 +02:00
Guillaume Nault
e9697e2eff l2tp: ignore L2TP_ATTR_MTU
This attribute's handling is broken. It can only be used when creating
Ethernet pseudo-wires, in which case its value can be used as the
initial MTU for the l2tpeth device.
However, when handling update requests, L2TP_ATTR_MTU only modifies
session->mtu. This value is never propagated to the l2tpeth device.
Dump requests also return the value of session->mtu, which is not
synchronised anymore with the device MTU.

The same problem occurs if the device MTU is properly updated using the
generic IFLA_MTU attribute. In this case, session->mtu is not updated,
and L2TP_ATTR_MTU will report an invalid value again when dumping the
session.

It does not seem worthwhile to complexify l2tp_eth.c to synchronise
session->mtu with the device MTU. Even the ip-l2tp manpage advises to
use 'ip link' to initialise the MTU of l2tpeth devices (iproute2 does
not handle L2TP_ATTR_MTU at all anyway). So let's just ignore it
entirely.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 10:03:57 -07:00
Guillaume Nault
789141b215 l2tp: simplify MTU handling in l2tp_ppp
The value of the session's .mtu field, as defined by
pppol2tp_connect() or pppol2tp_session_create(), is later overwritten
by pppol2tp_session_init() (unless getting the tunnel's socket PMTU
fails). This field is then only used when setting the PPP channel's MTU
in pppol2tp_connect().
Furthermore, the SIOC[GS]IFMTU ioctls only act on the session's .mtu
without propagating this value to the PPP channel, making them useless.

This patch initialises the PPP channel's MTU directly and ignores the
session's .mtu entirely. MTU is still computed by subtracting the
PPPOL2TP_HEADER_OVERHEAD constant. It is not optimal, but that doesn't
really matter: po->chan.mtu is only used when the channel is part of a
multilink PPP bundle. Running multilink PPP over packet switched
networks is certainly not going to be efficient, so not picking the
best MTU does not harm (in the worst case, packets will just be
fragmented by the underlay).

The SIOC[GS]IFMTU ioctls are removed entirely (as opposed to simply
ignored), because these ioctls commands are part of the requests that
should be handled generically by the socket layer. PX_PROTO_OL2TP was
the only socket type abusing these ioctls.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 10:03:57 -07:00
Guillaume Nault
1f5cd2a010 l2tp: define l2tp_tunnel_dst_mtu()
Consolidate retrieval of tunnel's socket mtu in order to simplify
l2tp_eth and l2tp_ppp a bit.

Signed-off-by: Guillaume Nault <g.nault@alphalink.fr>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 10:03:57 -07:00
Harsha Sharma
c753032690 netfilter: cttimeout: Make NF_CT_NETLINK_TIMEOUT depend on NF_CONNTRACK_TIMEOUT
With this, remove ifdef for CONFIG_NF_CONNTRACK_TIMEOUT in
nfnetlink_cttimeout. This is also required for moving ctnl_untimeout
from nfnetlink_cttimeout to nf_conntrack_timeout.

Signed-off-by: Harsha Sharma <harshasharmaiitr@gmail.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:50:41 +02:00
YueHaibing
1974d2453f netfilter: nf_tables: remove unused variable
Variable 'ext' is being assigned but are never used hence they are
unused and can be removed.

Cleans up clang warnings:
net/netfilter/nf_tables_api.c:4032:28: warning: variable ‘ext’ set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:50:35 +02:00
Kees Cook
54424d3891 rxrpc: Reuse SKCIPHER_REQUEST_ON_STACK buffer
The use of SKCIPHER_REQUEST_ON_STACK() will trigger FRAME_WARN warnings
(when less than 2048) once the VLA is no longer hidden from the check:

net/rxrpc/rxkad.c:398:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=]
net/rxrpc/rxkad.c:242:1: warning: the frame size of 1152 bytes is larger than 1024 bytes [-Wframe-larger-than=]

This passes the initial SKCIPHER_REQUEST_ON_STACK allocation to the leaf
functions for reuse. Two requests allocated on the stack is not needed
when only one is used at a time.

Signed-off-by: Kees Cook <keescook@chromium.org>
Acked-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 09:48:12 -07:00
Jiri Pirko
5ca8a25c14 net: sched: fix flush on non-existing chain
User was able to perform filter flush on chain 0 even if it didn't have
any filters in it. With the patch that avoided implicit chain 0
creation, this changed. So in case user wants filter flush on chain
which does not exist, just return success. There's no reason for non-0
chains to behave differently than chain 0, so do the same for them.

Reported-by: Ido Schimmel <idosch@mellanox.com>
Fixes: f71e0ca4db ("net: sched: Avoid implicit chain 0 creation")
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-03 09:44:37 -07:00
Florian Westphal
9e619d87b2 netfilter: nf_tables: flow event notifier must use transaction mutex
Fixes: f102d66b33 ("netfilter: nf_tables: use dedicated mutex to guard transactions")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:38:31 +02:00
Fernando Fernandez Mancera
ddba40be59 netfilter: nfnetlink_osf: rename nf_osf header file to nfnetlink_osf
The first client of the nf_osf.h userspace header is nft_osf, coming in
this batch, rename it to nfnetlink_osf.h as there are no userspace
clients for this yet, hence this looks consistent with other nfnetlink
subsystem.

Suggested-by: Jan Engelhardt <jengelh@inai.de>
Signed-off-by: Fernando Fernandez Mancera <ffmancera@riseup.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:38:30 +02:00
Li RongQing
285189c78e netfilter: use kvmalloc_array to allocate memory for hashtable
nf_ct_alloc_hashtable is used to allocate memory for conntrack,
NAT bysrc and expectation hashtable. Assuming 64k bucket size,
which means 7th order page allocation, __get_free_pages, called
by nf_ct_alloc_hashtable, will trigger the direct memory reclaim
and stall for a long time, when system has lots of memory stress

so replace combination of __get_free_pages and vzalloc with
kvmalloc_array, which provides a overflow check and a fallback
if no high order memory is available, and do not retry to reclaim
memory, reduce stall

and remove nf_ct_free_hashtable, since it is just a kvfree

Signed-off-by: Zhang Yu <zhangyu31@baidu.com>
Signed-off-by: Wang Li <wangli39@baidu.com>
Signed-off-by: Li RongQing <lirongqing@baidu.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2018-08-03 18:37:55 +02:00
Eric Biggers
8c30fbe63e crypto: scatterwalk - remove 'chain' argument from scatterwalk_crypto_chain()
All callers pass chain=0 to scatterwalk_crypto_chain().

Remove this unneeded parameter.

Signed-off-by: Eric Biggers <ebiggers@google.com>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
2018-08-03 18:06:03 +08:00
Steffen Klassert
07bf790895 xfrm: Validate address prefix lengths in the xfrm selector.
We don't validate the address prefix lengths in the xfrm
selector we got from userspace. This can lead to undefined
behaviour in the address matching functions if the prefix
is too big for the given address family. Fix this by checking
the prefixes and refuse SA/policy insertation when a prefix
is invalid.

Fixes: 1da177e4c3 ("Linux-2.6.12-rc2")
Reported-by: Air Icy <icytxw@gmail.com>
Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
2018-08-03 10:23:07 +02:00
Roman Gushchin
f42ee093be bpf/test_run: support cgroup local storage
Allocate a temporary cgroup storage to use for bpf program test runs.

Because the test program is not actually attached to a cgroup,
the storage is allocated manually just for the execution
of the bpf program.

If the program is executed multiple times, the storage is not zeroed
on each run, emulating multiple runs of the program, attached to
a real cgroup.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Roman Gushchin
cd33943176 bpf: introduce the bpf_get_local_storage() helper function
The bpf_get_local_storage() helper function is used
to get a pointer to the bpf local storage from a bpf program.

It takes a pointer to a storage map and flags as arguments.
Right now it accepts only cgroup storage maps, and flags
argument has to be 0. Further it can be extended to support
other types of local storage: e.g. thread local storage etc.

Signed-off-by: Roman Gushchin <guro@fb.com>
Cc: Alexei Starovoitov <ast@kernel.org>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Martin KaFai Lau <kafai@fb.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-08-03 00:47:32 +02:00
Matthieu Baerts
6b431d50d2 net/socket: remove duplicated init code
This refactoring work has been started by David Howells in cdfbabfb2f
(net: Work around lockdep limitation in sockets that use sockets) but
the exact same day in 581319c586 (net/socket: use per af lockdep
classes for sk queues), Paolo Abeni added new classes.

This reduces the amount of (nearly) duplicated code and eases the
addition of new socket types.

Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 14:47:49 -07:00
Ilya Dryomov
f1d10e0463 libceph: weaken sizeof check in ceph_x_verify_authorizer_reply()
Allow for extending ceph_x_authorize_reply in the future.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:26 +02:00
Ilya Dryomov
130f52f2b2 libceph: check authorizer reply/challenge length before reading
Avoid scribbling over memory if the received reply/challenge is larger
than the buffer supplied with the authorizer.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:26 +02:00
Ilya Dryomov
cc255c76c7 libceph: implement CEPHX_V2 calculation mode
Derive the signature from the entire buffer (both AES cipher blocks)
instead of using just the first half of the first block, leaving out
data_crc entirely.

This addresses CVE-2018-1129.

Link: http://tracker.ceph.com/issues/24837
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:25 +02:00
Ilya Dryomov
6daca13d2e libceph: add authorizer challenge
When a client authenticates with a service, an authorizer is sent with
a nonce to the service (ceph_x_authorize_[ab]) and the service responds
with a mutation of that nonce (ceph_x_authorize_reply).  This lets the
client verify the service is who it says it is but it doesn't protect
against a replay: someone can trivially capture the exchange and reuse
the same authorizer to authenticate themselves.

Allow the service to reject an initial authorizer with a random
challenge (ceph_x_authorize_challenge).  The client then has to respond
with an updated authorizer proving they are able to decrypt the
service's challenge and that the new authorizer was produced for this
specific connection instance.

The accepting side requires this challenge and response unconditionally
if the client side advertises they have CEPHX_V2 feature bit.

This addresses CVE-2018-1128.

Link: http://tracker.ceph.com/issues/24836
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:24 +02:00
Ilya Dryomov
149cac4a50 libceph: factor out encrypt_authorizer()
Will be used for encrypting both the initial and updated authorizers.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:24 +02:00
Ilya Dryomov
c571fe24d2 libceph: factor out __ceph_x_decrypt()
Will be used for decrypting the server challenge which is only preceded
by ceph_x_encrypt_header.

Drop struct_v check to allow for extending ceph_x_encrypt_header in the
future.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:23 +02:00
Ilya Dryomov
c0f56b483a libceph: factor out __prepare_write_connect()
Will be used for sending ceph_msg_connect with an updated authorizer,
after the server challenges the initial authorizer.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:22 +02:00
Ilya Dryomov
262614c429 libceph: store ceph_auth_handshake pointer in ceph_connection
We already copy authorizer_reply_buf and authorizer_reply_buf_len into
ceph_connection.  Factoring out __prepare_write_connect() requires two
more: authorizer_buf and authorizer_buf_len.  Store the pointer to the
handshake in con->auth rather than piling on.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
Reviewed-by: Sage Weil <sage@redhat.com>
2018-08-02 21:33:22 +02:00
Stephen Hemminger
24e1dd6afd ceph: fix whitespace
Remove blank lines at end of file and trailing whitespace.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:21 +02:00
Arnd Bergmann
fac02ddf91 libceph: use timespec64 for r_mtime
The request mtime field is used all over ceph, and is currently
represented as a 'timespec' structure in Linux. This changes it to
timespec64 to allow times beyond 2038, modifying all users at the
same time.

[ Remove now redundant ts variable in writepage_nounlock(). ]

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:33:14 +02:00
Arnd Bergmann
473bd2d780 libceph: use timespec64 in for keepalive2 and ticket validity
ceph_con_keepalive_expired() is the last user of timespec_add() and some
of the last uses of ktime_get_real_ts().  Replacing this with timespec64
based interfaces  lets us remove that deprecated API.

I'm introducing new ceph_encode_timespec64()/ceph_decode_timespec64()
here that take timespec64 structures and convert to/from ceph_timespec,
which is defined to have an unsigned 32-bit tv_sec member. This extends
the range of valid times to year 2106, avoiding the year 2038 overflow.

The ceph file system portion still uses the old functions for inode
timestamps, this will be done separately after the VFS layer is converted.

Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:12 +02:00
Ilya Dryomov
2f56b6bae7 libceph: amend "bad option arg" error message
Don't mention "mount" -- in the rbd case it is "mapping".

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Chengguang Xu
17173c82e3 libceph: stop parsing when a bad int arg is detected
There is no reason to continue option parsing after detecting
bad option.

[ Return match_int() errors from ceph_parse_options() to match the
  behaviour of parse_rbd_opts_token() and parse_fsopt_token(). ]

Signed-off-by: Chengguang Xu <cgxu519@gmx.com>
Reviewed-by: Ilya Dryomov <idryomov@gmail.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
Ilya Dryomov
6d54228fd1 libceph: make ceph_osdc_notify{,_ack}() payload_len u32
The wire format dictates that payload_len fits into 4 bytes.  Take u32
instead of size_t to reflect that.

All callers pass a small integer, so no changes required.

Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
2018-08-02 21:26:11 +02:00
David S. Miller
89b1698c93 Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/net
The BTF conflicts were simple overlapping changes.

The virtio_net conflict was an overlap of a fix of statistics counter,
happening alongisde a move over to a bonafide statistics structure
rather than counting value on the stack.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:55:32 -07:00
Ganesh Goudar
9aba2f801e net: Fix coding style in skb_push()
Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:29:25 -07:00
Wei Yongjun
770b26de1e rxrpc: Remove set but not used variable 'nowj'
Fixes gcc '-Wunused-but-set-variable' warning:

net/rxrpc/proc.c: In function 'rxrpc_call_seq_show':
net/rxrpc/proc.c:66:29: warning:
 variable 'nowj' set but not used [-Wunused-but-set-variable]
  unsigned long timeout = 0, nowj;
                             ^

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-02 10:18:20 -07:00
David S. Miller
e6aed040ea Revert "net/ipv6: fix metrics leak"
This reverts commit df18b50448.

This change causes other problems and use-after-free situations as
found by syzbot.

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 21:32:30 -07:00
David S. Miller
b69ab96ab1 RxRPC development
-----BEGIN PGP SIGNATURE-----
 
 iQIVAwUAW2GoxPu3V2unywtrAQJf6A//TXRbmUri7DBFXf5iDPaA7ItPFG0wBmmu
 E68/SAMsWZsYnpAY5HGIDufjjdvPl9R7TMSsurIyZl1ZMwzwFiO7LtK5pXvfe17a
 UJbZc1jPRn8sUuC3bDhRlAtLETqw9Wx0n9GAbLW06XdQsrwnBrg4yGA6HUc2iy8w
 l0b56G9VhPV27hgKDIvhtkL4c+Ek8qjV6f6bcPSGNtmoepPVh9Jg0fJY7zZbRWm4
 tFiPv1nTd/ojQm0MMsyPodCkK+oG3tCji17fy0ZYiXv2nupBXDS6NOoNActRJ1CA
 RE3hINoeTLtm7h5hlzCEwkG1qr6QPNE9QmSoJY9aViuJTjJKlIJGa/0i2Nl0rpgu
 HLzg3ifpcrI//KywkTFVxLk1Fp/A6JZK5fNPibpXXoVB6U6Zl+BfpaHoJ7kmnODT
 xX3NbM0qRV5bbzHWnxiG1UieXDQWr7Sc+0cJslz0sTj/64ktJ4ldwJLdO5El2xrU
 QHCOIQEsB5YXTx7vAmsXDnMNDmKgnlXXzkzjcG1dJPlOvcLmtm/5HOOaum2/A7ox
 HuV6wbtHwOTr8KvZnbcsa0pMCctC0icEpbg9TyZf43zEmMdWNmep3A/vRf72fxv4
 EFCsx5jc1A3KHDc0HaK8pmTVqUxW6al56iLH5gzn/KRwqgx392iutBJ/2y5WM4Z0
 8kzM5XMDpQg=
 =pQXe
 -----END PGP SIGNATURE-----

Merge tag 'rxrpc-next-20180801' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs

David Howells says:

====================
rxrpc: Development

Here are some patches that add some more tracepoints to AF_RXRPC and fix
some issues therein.  The most significant points are:

 (1) Display the call timeout information in /proc/net/rxrpc/calls.

 (2) Save the call's debug_id in the rxrpc_channel struct so that it can be
     used in traces after the rxrpc_call struct has been destroyed.

 (3) Increase the size of the kAFS Rx window from 32 to 63 to be about the
     same as the Auristor server.

 (4) Propose the terminal ACK for a client call after it has received all
     its data to be transmitted after a short interval so that it will get
     transmitted if not first superseded by a new call on the same channel.

 (5) Flush ACKs during the data reception if we detect that we've run out
     of data.[*]

 (6) Trace successful packet transmission and softirq to process context
     socket notification.

[*] Note that on a uncontended gigabit network, rxrpc runs in to trouble
    with ACK packets getting batched together (up to ~32 at a time)
    somewhere between the IP transmit queue on the client and the ethernet
    receive queue on the server.

    I can see the kernel afs filesystem client and Auristor userspace
    server stalling occasionally on a 512MB single read.  Sticking
    tracepoints in the network driver at either end seems to show that,
    although the ACK transmissions made by the client are reasonably spaced
    timewise, the received ACKs come in batches from the network card on
    the server.

    I'm not sure what, if anything, can be done about this.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 11:50:55 -07:00
YueHaibing
c01f6c9b32 rxrpc: Fix user call ID check in rxrpc_service_prealloc_one
There just check the user call ID isn't already in use, hence should
compare user_call_ID with xcall->user_call_ID, which is current
node's user_call_ID.

Fixes: 540b1c48c3 ("rxrpc: Fix deadlock between call creation and sendmsg/recvmsg")
Suggested-by: David Howells <dhowells@redhat.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 11:49:49 -07:00
Jiri Pirko
290b1c8b1a net: sched: make tcf_chain_{get,put}() static
These are no longer used outside of cls_api.c so make them static.
Move tcf_chain_flush() to avoid fwd declaration of tcf_chain_put().

Signed-off-by: Jiri Pirko <jiri@mellanox.com>

v1->v2:
- new patch

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:06:19 -07:00
Jiri Pirko
5368140730 net: sched: fix notifications for action-held chains
Chains that only have action references serve as placeholders.
Until a non-action reference is created, user should not be aware
of the chain. Also he should not receive any notifications about it.
So send notifications for the new chain only in case the chain gets
the first non-action reference. Symmetrically to that, when
the last non-action reference is dropped, send the notification about
deleted chain.

Reported-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>

v1->v2:
- made __tcf_chain_{get,put}() static as suggested by Cong

Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:06:19 -07:00
Jiri Pirko
3d32f4c548 net: sched: change name of zombie chain to "held_by_acts_only"
As mentioned by Cong and Jakub during the review process, it is a bit
odd to sometimes (act flow) create a new chain which would be
immediately a "zombie". So just rename it to "held_by_acts_only".

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Suggested-by: Cong Wang <xiyou.wangcong@gmail.com>
Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Acked-by: Cong Wang <xiyou.wangcong@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:06:19 -07:00
YueHaibing
87f70132b0 rds: remove redundant variable 'rds_ibdev'
Variable 'rds_ibdev' is being assigned but never used,
so can be removed.

fix this clang warning:
 net/rds/ib_send.c:762:24: warning: variable ‘rds_ibdev’ set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:00:54 -07:00
YueHaibing
bd707f17ef strparser: remove redundant variable 'rd_desc'
Variable 'rd_desc' is being assigned but never used,
so can be removed.

fix this clang warning:
net/strparser/strparser.c:411:20: warning: variable ‘rd_desc’ set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 10:00:06 -07:00
YueHaibing
1296ee8ffc ip_gre: remove redundant variables t_hlen
After commit ffc2b6ee41 ("ip_gre: fix IFLA_MTU ignored on NEWLINK")
variable t_hlen is assigned values that are never read,
hence they are redundant and can be removed.

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:58:15 -07:00
Wei Yongjun
13dde04f5c tcp: remove set but not used variable 'skb_size'
Fixes gcc '-Wunused-but-set-variable' warning:

net/ipv4/tcp_output.c: In function 'tcp_collapse_retrans':
net/ipv4/tcp_output.c:2700:6: warning:
 variable 'skb_size' set but not used [-Wunused-but-set-variable]
  int skb_size, next_skb_size;
      ^

Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:57:09 -07:00
Wei Wang
7ec65372ca tcp: add stat of data packet reordering events
Introduce a new TCP stats to record the number of reordering events seen
and expose it in both tcp_info (TCP_INFO) and opt_stats
(SOF_TIMESTAMPING_OPT_STATS).
Application can use this stats to track the frequency of the reordering
events in addition to the existing reordering stats which tracks the
magnitude of the latest reordering event.

Note: this new stats tracks reordering events triggered by ACKs, which
could often be fewer than the actual number of packets being delivered
out-of-order.

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
7e10b6554f tcp: add dsack blocks received stats
Introduce a new TCP stat to record the number of DSACK blocks received
(RFC4989 tcpEStatsStackDSACKDups) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
fb31c9b9f6 tcp: add data bytes retransmitted stats
Introduce a new TCP stat to record the number of bytes retransmitted
(RFC4898 tcpEStatsPerfOctetsRetrans) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
ba113c3aa7 tcp: add data bytes sent stats
Introduce a new TCP stat to record the number of bytes sent
(RFC4898 tcpEStatsPerfHCDataOctetsOut) and expose it in both tcp_info
(TCP_INFO) and opt_stats (SOF_TIMESTAMPING_OPT_STATS).

Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Wei Wang
984988aa72 tcp: add a helper to calculate size of opt_stats
This is to refactor the calculation of the size of opt_stats to a helper
function to make the code cleaner and easier for later changes.

Suggested-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: Wei Wang <weiwan@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:56:10 -07:00
Florian Fainelli
a94c689e6c net: dsa: Do not suspend/resume closed slave_dev
If a DSA slave network device was previously disabled, there is no need
to suspend or resume it.

Fixes: 2446254915 ("net: dsa: allow switch drivers to implement suspend/resume hooks")
Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Andrew Lunn <andrew@lunn.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:53:45 -07:00
Petr Machata
d18c5d1995 net: ipv4: Notify about changes to ip_forward_update_priority
Drivers may make offloading decision based on whether
ip_forward_update_priority is enabled or not. Therefore distribute
netevent notifications to give them a chance to react to a change.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:52:30 -07:00
Petr Machata
432e05d328 net: ipv4: Control SKB reprioritization after forwarding
After IPv4 packets are forwarded, the priority of the corresponding SKB
is updated according to the TOS field of IPv4 header. This overrides any
prioritization done earlier by e.g. an skbedit action or ingress-qos-map
defined at a vlan device.

Such overriding may not always be desirable. Even if the packet ends up
being routed, which implies this is an L3 network node, an administrator
may wish to preserve whatever prioritization was done earlier on in the
pipeline.

Therefore introduce a sysctl that controls this behavior. Keep the
default value at 1 to maintain backward-compatible behavior.

Signed-off-by: Petr Machata <petrm@mellanox.com>
Reviewed-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:52:30 -07:00
Jeremy Cline
bc5b6c0b62 netlink: Fix spectre v1 gadget in netlink_create()
'protocol' is a user-controlled value, so sanitize it after the bounds
check to avoid using it for speculative out-of-bounds access to arrays
indexed by it.

This addresses the following accesses detected with the help of smatch:

* net/netlink/af_netlink.c:654 __netlink_create() warn: potential
  spectre issue 'nlk_cb_mutex_keys' [w]

* net/netlink/af_netlink.c:654 __netlink_create() warn: potential
  spectre issue 'nlk_cb_mutex_key_strings' [w]

* net/netlink/af_netlink.c:685 netlink_create() warn: potential spectre
  issue 'nl_table' [w] (local cap)

Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: Jeremy Cline <jcline@redhat.com>
Reviewed-by: Josh Poimboeuf <jpoimboe@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:50:58 -07:00
Vincent Bernat
83ba464515 net: add helpers checking if socket can be bound to nonlocal address
The construction "net->ipv4.sysctl_ip_nonlocal_bind || inet->freebind
|| inet->transparent" is present three times and its IPv6 counterpart
is also present three times. We introduce two small helpers to
characterize these tests uniformly.

Signed-off-by: Vincent Bernat <vincent@bernat.im>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:50:04 -07:00
zhong jiang
969d509003 net/tls: Use kmemdup to simplify the code
Kmemdup is better than kmalloc+memcpy. So replace them.

Signed-off-by: zhong jiang <zhongjiang@huawei.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:47:47 -07:00
Colin Ian King
b053fcc4a1 net/tipc: remove redundant variables 'tn' and 'oport'
Variables 'tn' and 'oport'  are being assigned but are never used hence
they are redundant and can be removed.

Cleans up clang warnings:
warning: variable 'oport' set but not used [-Wunused-but-set-variable]
warning: variable 'tn' set but not used [-Wunused-but-set-variable]

Signed-off-by: Colin Ian King <colin.king@canonical.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:46:49 -07:00
Ka-Cheong Poon
e65d4d9633 rds: Remove IPv6 dependency
This patch removes the IPv6 dependency from RDS.

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:32:35 -07:00
Ka-Cheong Poon
f394ad28fe rds: rds_ib_recv_alloc_cache() should call alloc_percpu_gfp() instead
Currently, rds_ib_conn_alloc() calls rds_ib_recv_alloc_caches()
without passing along the gfp_t flag.  But rds_ib_recv_alloc_caches()
and rds_ib_recv_alloc_cache() should take a gfp_t parameter so that
rds_ib_recv_alloc_cache() can call alloc_percpu_gfp() using the
correct flag instead of calling alloc_percpu().

Signed-off-by: Ka-Cheong Poon <ka-cheong.poon@oracle.com>
Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2018-08-01 09:32:35 -07:00
David Howells
d0b35a4203 rxrpc: Transmit more ACKs during data reception
Immediately flush any outstanding ACK on entry to rxrpc_recvmsg_data() -
which transfers data to the target buffers - if we previously had an Rx
underrun (ie. we returned -EAGAIN because we ran out of received data).
This lets the server know what we've managed to receive something.

Also flush any outstanding ACK after calling the function if it hit -EAGAIN
to let the server know we processed some data.

It might be better to send more ACKs, possibly on a time-based scheme, but
that needs some more consideration.

With this and some additional AFS patches, it is possible to get large
unencrypted O_DIRECT reads to be almost as fast as NFS over TCP.  It looks
like it might be theoretically possible to improve performance yet more for
a server running a single operation as investigation of packet timestamps
indicates that the server keeps stalling.

The issue appears to be that rxrpc runs in to trouble with ACK packets
getting batched together (up to ~32 at a time) somewhere between the IP
transmit queue on the client and the ethernet receive queue on the server.

However, this case isn't too much of a worry as even a lightly loaded
server should be receiving sufficient packet flux to flush the ACK packets
to the UDP socket.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:24 +01:00
David Howells
a71a2651bd rxrpc: Propose, but don't immediately transmit, the final ACK for a call
The final ACK that closes out an rxrpc call needs to be transmitted by the
client unless we're going to follow up with a DATA packet for a new call on
the same channel (which implicitly ACK's the previous call, thereby saving
an ACK).

Currently, we don't do that, so if no follow on call is immediately
forthcoming, the server will resend the last DATA packet - at which point
rxrpc_conn_retransmit_call() will be triggered and will (re)send the final
ACK.  But the server has to hold on to the last packet until the ACK is
received, thereby holding up its resources.

Fix the client side to propose a delayed final ACK, to be transmitted after
a short delay, assuming the call isn't superseded by a new one.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:24 +01:00
David Howells
4075295ab8 rxrpc: Increase the size of a call's Rx window
Increase the size of a call's Rx window from 32 to 63 - ie. one less than
the size of the ring buffer.  This makes large data transfers perform
better when the Tx window on the other side is around 64 (as is the case
with Auristor's YFS fileserver).

If the server window size is ~32 or smaller, this should make no
difference.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
David Howells
4272d3034e rxrpc: Trace socket notification
Trace notifications from the softirq side of the socket to the
process-context side.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
David Howells
4764c0da69 rxrpc: Trace packet transmission
Trace successful packet transmission (kernel_sendmsg() succeeded, that is)
in AF_RXRPC.  We can share the enum that defines the transmission points
with the trace_rxrpc_tx_fail() tracepoint, so rename its constants to be
applicable to both.

Also, save the internal call->debug_id in the rxrpc_channel struct so that
it can be used in retransmission trace lines.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
David Howells
f3f8337c9e rxrpc: Fix the trace for terminal ACK (re)transmission
Fix the trace for terminal ACK (re)transmission to put in the right
parameters.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
David Howells
6b97bd7a27 rxrpc: Show some more information through /proc files
Show the four current call IDs in /proc/net/rxrpc/conns.

Show the current packet Rx serial number in /proc/net/rxrpc/calls.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
David Howells
887763bbc3 rxrpc: Display call expect-receive-by timeout in proc
Display in /proc/net/rxrpc/calls the timeout by which a call next expects
to receive a packet.

This makes it easier to debug timeout issues.

Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
YueHaibing
f597a5792a rxrpc: remove redundant variables 'sp' and 'did_discard'
Variables 'sp' and 'did_discard' are being assigned,
but are never used, hence they are redundant and can be removed.

fix following warning:

net/rxrpc/call_event.c:165:25: warning: variable 'sp' set but not used [-Wunused-but-set-variable]
net/rxrpc/conn_client.c:1054:7: warning: variable 'did_discard' set but not used [-Wunused-but-set-variable]

Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: David Howells <dhowells@redhat.com>
2018-08-01 13:28:23 +01:00
Mark Salyzyn
7992c18810 Bluetooth: hidp: buffer overflow in hidp_process_report
CVE-2018-9363

The buffer length is unsigned at all layers, but gets cast to int and
checked in hidp_process_report and can lead to a buffer overflow.
Switch len parameter to unsigned int to resolve issue.

This affects 3.18 and newer kernels.

Signed-off-by: Mark Salyzyn <salyzyn@android.com>
Fixes: a4b1b5877b ("HID: Bluetooth: hidp: make sure input buffers are big enough")
Cc: Marcel Holtmann <marcel@holtmann.org>
Cc: Johan Hedberg <johan.hedberg@gmail.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Kees Cook <keescook@chromium.org>
Cc: Benjamin Tissoires <benjamin.tissoires@redhat.com>
Cc: linux-bluetooth@vger.kernel.org
Cc: netdev@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Cc: security@kernel.org
Cc: kernel-team@android.com
Acked-by: Kees Cook <keescook@chromium.org>
Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
2018-08-01 09:12:35 +02:00