Commit Graph

39454 Commits

Author SHA1 Message Date
Eric W. Biederman
6a1d689d9f netfilter: ipt_SYNPROXY: Pass snet into synproxy_send_tcp
ip6t_SYNPROXY already does this and this is needed so that we have a
struct net that can be passed down into ip_route_me_harder, so
that ip_route_me_harder can stop guessing it's context.

Along the way pass snet into synproxy_send_client_synack as this
is the only caller of synprox_send_tcp that is not passed snet
already.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-29 20:21:31 +02:00
Eric W. Biederman
d815d90bbb netfilter: Push struct net down into nf_afinfo.reroute
The network namespace is needed when routing a packet.
Stop making nf_afinfo.reroute guess which network namespace
is the proper namespace to route the packet in.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-29 20:21:31 +02:00
Eric W. Biederman
372892ec11 ipv4: Push struct net down into nf_send_reset
This is needed so struct net can be pushed down into
ip_route_me_harder.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-29 20:21:31 +02:00
Johannes Berg
076cdcb12f mac80211: use bool argument to ieee80211_send_nullfunc
Instead of int with 0/1, use bool with false/true for the
powersave argument to ieee80211_send_nullfunc().

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:52 +02:00
Johannes Berg
90d13e8f5b mac80211: reduce indentation by inlining a check
Instead of nesting two if statements, inline the second
check into the first if statement and to indentation.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:52 +02:00
Felix Fietkau
50f36ae61a mac80211: fix tx sequence number assignment with software queue + fast-xmit
When using software queueing, tx sequence number assignment happens at
ieee80211_tx_dequeue time, so the fast-xmit codepath must not do that.

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:51 +02:00
Ayala Beker
44674d9c22 mac80211: advertise support for full station state in AP mode
This enables adding stations in unauthenticated mode, just after
receiving the first authentication frame; which in turn allows
sending a negative authentication reply if the station cannot be
added.

In addition init rate control for unassociated station only when
it becomes associated, prior to that low rates will be used.

Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:51 +02:00
Ayala Beker
47edb11b52 cfg80211: allow changing station capabilities for unassociated stations
Currently, cfg80211 rejects capability updates for existing entries
and as a result it's impossible to update entries that were added
unassociated, but that is necessary to go through the full station
states from userspace, adding a station before authentication etc.

Fix this by allowing updates to capabilities for stations that the
driver (or mac80211) assigned unassociated state. Drivers setting
the full station state support flag must use the new station type
for proper operation.

Signed-off-by: Ayala Beker <ayala.beker@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:50 +02:00
Johannes Berg
d0a77c6569 mac80211: allow writing TX PN in debugfs
For certain tests, for example replay detection, it can be useful
to be able to influence/set the PN used in outgoing packets. Make
it possible to change the TX PN in debugfs.

For now, this doesn't support TKIP since I haven't needed it, but
there's no reason it couldn't be added if necessary.

Note that this must be used very carefully: it could, for example,
be used to make "valid replays" where the PN reuse happens on a
different TID. This couldn't be done by an attacker since the TID
is protected as part of the AAD.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:50 +02:00
Denys Vlasenko
416eb9fc29 mac80211: Deinline drv_get/set/reset_tsf()
With this .config: http://busybox.net/~vda/kernel_config_ALLYES_Os,
after deinlining these functions have sizes and callsite counts
as follows:

drv_get_tsf: 634 bytes, 6 calls
drv_set_tsf: 626 bytes, 2 calls
drv_reset_tsf: 617 bytes, 2 calls

Total size reduction is about 4.2 kbytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Johannes Berg <johannes.berg@intel.com>
CC: John Linville <linville@tuxdriver.com>
CC: Michal Kazior <michal.kazior@tieto.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:49 +02:00
Denys Vlasenko
6db9683897 mac80211: Deinline drv_ampdu_action()
With this .config: http://busybox.net/~vda/kernel_config_ALLYES_Os,
after deinlining the function size is 755 bytes and there are
6 callsites.

Total size reduction is about 3.3 kbytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Johannes Berg <johannes.berg@intel.com>
CC: John Linville <linville@tuxdriver.com>
CC: Michal Kazior <michal.kazior@tieto.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:49 +02:00
Denys Vlasenko
42677ed33a mac80211: Deinline drv_switch_vif_chanctx()
With this .config: http://busybox.net/~vda/kernel_config_ALLYES_Os,
after deinlining the function size is 821 bytes and there are
2 callsites, reducing code size by about 800 bytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Johannes Berg <johannes.berg@intel.com>
CC: John Linville <linville@tuxdriver.com>
CC: Michal Kazior <michal.kazior@tieto.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
[adjust code-style a bit]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:48 +02:00
Johannes Berg
0e5c371aa0 mac80211: improve __rate_control_send_low warning
If there are no supported rates in the rate mask with the required
flags, we warn, but it's not clear which part causes the warning.

Add the relevant data to the warning to understand why it happens.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:48 +02:00
Johannes Berg
a0c391b134 mac80211: minstrel[_ht]: remove non-ascii debugfs characters
Replace the average symbol by "avg" to avoid being warned about the
non-ASCII symbol all the time, line up the columns properly.

(I changed my mind - the warnings are getting annoying)

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:47 +02:00
Eliad Peller
2739271954 mac80211: don't tear down aggregation on suspend in case of wowlan->any
In case of "any" wowlan trigger, there is no reason to tear down
aggregations, as we want the device to continue working normally.

Similarly, there's no reason to tear down aggregations on resume,
as they should have been torn down on suspend if needed.
However, since the reconfiguration flow is shared with HW restart,
tear down aggregations on reconfiguration when we are not resuming.

To keep things working after non-wowlan suspend, keep clearing the
WLAN_STA_BLOCK_BA flag.

Signed-off-by: Eliad Peller <eliadx.peller@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:47 +02:00
Fu, Zhonghui
9f0e13546e net/wireless: enable wiphy device to suspend/resume asynchronously
Now, PM core supports asynchronous suspend/resume mode for devices
during system suspend/resume, and the power state transition of one
device may be completed in separate kernel thread. PM core ensures
all power state transition timing dependency between devices. This
patch enables wiphy device to suspend/resume asynchronously. This can
take advantage of multicore and improve system suspend/resume speed.

Signed-off-by: Zhonghui Fu <zhonghui.fu@linux.intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:46 +02:00
Denys Vlasenko
9aae296a62 mac80211: Deinline drv_add/remove/change_interface()
With this .config: http://busybox.net/~vda/kernel_config_ALLYES_Os,
after deinlining these functions have sizes and callsite counts
as follows:

drv_add_interface: 638 bytes, 5 calls
drv_remove_interface: 611 bytes, 6 calls
drv_change_interface: 658 bytes, 1 call

Total size reduction is about 9 kbytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: John Linville <linville@tuxdriver.com>
CC: Michal Kazior <michal.kazior@tieto.com>
CC: Johannes Berg <johannes.berg@intel.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:46 +02:00
Denys Vlasenko
4fbd572c29 mac80211: Deinline drv_sta_rc_update()
With this .config: http://busybox.net/~vda/kernel_config_ALLYES_Os,
after deinlining the function size is 706 bytes and there are
2 callsites, reducing code size by about 700 bytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: John Linville <linville@tuxdriver.com>
CC: Michal Kazior <michal.kazior@tieto.com>
CC: Johannes Berg <johannes.berg@intel.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:45 +02:00
Denys Vlasenko
b23dcd4aca mac80211: Deinline drv_conf_tx()
With this .config: http://busybox.net/~vda/kernel_config_ALLYES_Os,
after deinlining the function size is 785 bytes and there are
7 callsites.

Total size reduction is about 3.5 kbytes.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: John Linville <linville@tuxdriver.com>
CC: Michal Kazior <michal.kazior@tieto.com>
CC: Johannes Berg <johannes.berg@intel.com>
CC: linux-wireless@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:45 +02:00
Helmut Schaa
35afa58862 mac80211: Copy tx'ed beacons to monitor mode
When debugging wireless powersave issues on the AP side it's quite helpful
to see our own beacons that are transmitted by the hardware/driver. However,
this is not that easy since beacons don't pass through the regular TX queues.

Preferably drivers would call ieee80211_tx_status also for tx'ed beacons
but that's not always possible. Hence, just send a copy of each beacon
generated by ieee80211_beacon_get_tim to monitor devices when they are
getting fetched by the driver.

Also add a HW flag IEEE80211_HW_BEACON_TX_STATUS that can be used by
drivers to indicate that they report TX status for beacons.

Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
(with a fix from Christian Lamparted rolled in)
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-29 15:56:33 +02:00
Denys Vlasenko
2103d6b818 net: sctp: Don't use 64 kilobyte lookup table for four elements
Seemingly innocuous sctp_trans_state_to_prio_map[] array
is way bigger than it looks, since
"[SCTP_UNKNOWN] = 2" expands into "[0xffff] = 2" !

This patch replaces it with switch() statement.

Signed-off-by: Denys Vlasenko <dvlasenk@redhat.com>
CC: Vlad Yasevich <vyasevich@gmail.com>
CC: Neil Horman <nhorman@tuxdriver.com>
CC: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
CC: linux-sctp@vger.kernel.org
CC: netdev@vger.kernel.org
CC: linux-kernel@vger.kernel.org
Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com>
Acked-by: Neil Horman <nhorman@tuxdriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:52:21 -07:00
Jesper Dangaard Brouer
8a4683a5e0 net: help compiler generate better code in eth_get_headlen
Noticed that the compiler (gcc version 4.8.5 20150623 (Red Hat 4.8.5-4) (GCC))
generated suboptimal assembler code in eth_get_headlen().

This early return coding style is usually not an issue, on super scalar CPUs,
but the compiler choose to put the return statement after this very unlikely
branch, thus creating larger jump down to the likely code path.

Performance wise, I could measure slightly less L1-icache-load-misses
and less branch-misses, and an improvement of 1 nanosec with an IP-forwarding
use-case with 257 bytes packets with ixgbe (CPU i7-4790K @ 4.00GHz).

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:51:15 -07:00
Alexander Couzens
06a15f51cf l2tp: protect tunnel->del_work by ref_count
There is a small chance that tunnel_free() is called before tunnel->del_work scheduled
resulting in a zero pointer dereference.

Signed-off-by: Alexander Couzens <lynxis@fe80.eu>
Acked-by: James Chapman <jchapman@katalix.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:39:10 -07:00
Bendik Rønning Opstad
d2e1339f40 tcp: Fix CWV being too strict on thin streams
Application limited streams such as thin streams, that transmit small
amounts of payload in relatively few packets per RTT, can be prevented
from growing the CWND when in congestion avoidance. This leads to
increased sojourn times for data segments in streams that often transmit
time-dependent data.

Currently, a connection is considered CWND limited only after having
successfully transmitted at least one packet with new data, while at the
same time failing to transmit some unsent data from the output queue
because the CWND is full. Applications that produce small amounts of
data may be left in a state where it is never considered to be CWND
limited, because all unsent data is successfully transmitted each time
an incoming ACK opens up for more data to be transmitted in the send
window.

Fix by always testing whether the CWND is fully used after successful
packet transmissions, such that a connection is considered CWND limited
whenever the CWND has been filled. This is the correct behavior as
specified in RFC2861 (section 3.1).

Cc: Andreas Petlund <apetlund@simula.no>
Cc: Carsten Griwodz <griff@simula.no>
Cc: Jonas Markussen <jonassm@ifi.uio.no>
Cc: Kenneth Klette Jonassen <kennetkl@ifi.uio.no>
Cc: Mads Johannessen <madsjoh@ifi.uio.no>
Signed-off-by: Bendik Rønning Opstad <bro.devel+kernel@gmail.com>
Acked-by: Eric Dumazet <edumazet@google.com>
Tested-by: Eric Dumazet <edumazet@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:36:30 -07:00
David Ahern
17fb0b2b90 net: Remove redundant oif checks in rt6_device_match
The oif has already been checked that it is non-zero; the 2 additional
checks on oif within that if (oif) {...} block are redundant.

CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:30:24 -07:00
Eric Dumazet
7c85af8810 tcp: avoid reorders for TFO passive connections
We found that a TCP Fast Open passive connection was vulnerable
to reorders, as the exchange might look like

[1] C -> S S <FO ...> <request>
[2] S -> C S. ack request <options>
[3] S -> C . <answer>

packets [2] and [3] can be generated at almost the same time.

If C receives the 3rd packet before the 2nd, it will drop it as
the socket is in SYN_SENT state and expects a SYNACK.

S will have to retransmit the answer.

Current OOO avoidance in linux is defeated because SYNACK
packets are attached to the LISTEN socket, while DATA packets
are attached to the children. They might be sent by different cpus,
and different TX queues might be selected.

It turns out that for TFO, we created a child, which is a
full blown socket in TCP_SYN_RECV state, and we simply can attach
the SYNACK packet to this socket.

This means that at the time tcp_sendmsg() pushes DATA packet,
skb->ooo_okay will be set iff the SYNACK packet had been sent
and TX completed.

This removes the reorder source at the host level.

We also removed the export of tcp_try_fastopen(), as it is no
longer called from IPv6.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 22:11:19 -07:00
Karl Heiss
635682a144 sctp: Prevent soft lockup when sctp_accept() is called during a timeout event
A case can occur when sctp_accept() is called by the user during
a heartbeat timeout event after the 4-way handshake.  Since
sctp_assoc_migrate() changes both assoc->base.sk and assoc->ep, the
bh_sock_lock in sctp_generate_heartbeat_event() will be taken with
the listening socket but released with the new association socket.
The result is a deadlock on any future attempts to take the listening
socket lock.

Note that this race can occur with other SCTP timeouts that take
the bh_lock_sock() in the event sctp_accept() is called.

 BUG: soft lockup - CPU#9 stuck for 67s! [swapper:0]
 ...
 RIP: 0010:[<ffffffff8152d48e>]  [<ffffffff8152d48e>] _spin_lock+0x1e/0x30
 RSP: 0018:ffff880028323b20  EFLAGS: 00000206
 RAX: 0000000000000002 RBX: ffff880028323b20 RCX: 0000000000000000
 RDX: 0000000000000000 RSI: ffff880028323be0 RDI: ffff8804632c4b48
 RBP: ffffffff8100bb93 R08: 0000000000000000 R09: 0000000000000000
 R10: ffff880610662280 R11: 0000000000000100 R12: ffff880028323aa0
 R13: ffff8804383c3880 R14: ffff880028323a90 R15: ffffffff81534225
 FS:  0000000000000000(0000) GS:ffff880028320000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
 CR2: 00000000006df528 CR3: 0000000001a85000 CR4: 00000000000006e0
 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
 Process swapper (pid: 0, threadinfo ffff880616b70000, task ffff880616b6cab0)
 Stack:
 ffff880028323c40 ffffffffa01c2582 ffff880614cfb020 0000000000000000
 <d> 0100000000000000 00000014383a6c44 ffff8804383c3880 ffff880614e93c00
 <d> ffff880614e93c00 0000000000000000 ffff8804632c4b00 ffff8804383c38b8
 Call Trace:
 <IRQ>
 [<ffffffffa01c2582>] ? sctp_rcv+0x492/0xa10 [sctp]
 [<ffffffff8148c559>] ? nf_iterate+0x69/0xb0
 [<ffffffff814974a0>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff8148c716>] ? nf_hook_slow+0x76/0x120
 [<ffffffff814974a0>] ? ip_local_deliver_finish+0x0/0x2d0
 [<ffffffff8149757d>] ? ip_local_deliver_finish+0xdd/0x2d0
 [<ffffffff81497808>] ? ip_local_deliver+0x98/0xa0
 [<ffffffff81496ccd>] ? ip_rcv_finish+0x12d/0x440
 [<ffffffff81497255>] ? ip_rcv+0x275/0x350
 [<ffffffff8145cfeb>] ? __netif_receive_skb+0x4ab/0x750
 ...

With lockdep debugging:

 =====================================
 [ BUG: bad unlock balance detected! ]
 -------------------------------------
 CslRx/12087 is trying to release lock (slock-AF_INET) at:
 [<ffffffffa01bcae0>] sctp_generate_timeout_event+0x40/0xe0 [sctp]
 but there are no more locks to release!

 other info that might help us debug this:
 2 locks held by CslRx/12087:
 #0:  (&asoc->timers[i]){+.-...}, at: [<ffffffff8108ce1f>] run_timer_softirq+0x16f/0x3e0
 #1:  (slock-AF_INET){+.-...}, at: [<ffffffffa01bcac3>] sctp_generate_timeout_event+0x23/0xe0 [sctp]

Ensure the socket taken is also the same one that is released by
saving a copy of the socket before entering the timeout event
critical section.

Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 21:03:40 -07:00
Karl Heiss
f05940e618 sctp: Whitespace fix
Fix indentation in sctp_generate_heartbeat_event.

Signed-off-by: Karl Heiss <kheiss@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-28 21:03:40 -07:00
Ian Wilson
34c2d9fb04 bridge: Allow forward delay to be cfgd when STP enabled
Allow bridge forward delay to be configured when Spanning Tree is enabled.

Signed-off-by: Ian Wilson <iwilson@brocade.com>
Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-27 19:09:38 -07:00
Jiri Benc
b1be00a6c3 vxlan: support both IPv4 and IPv6 sockets in a single vxlan device
For metadata based vxlan interface, open both IPv4 and IPv6 socket. This is
much more user friendly: it's not necessary to create two vxlan interfaces
and pay attention to using the right one in routing rules.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 22:40:55 -07:00
David S. Miller
4963ed48f2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Conflicts:
	net/ipv4/arp.c

The net/ipv4/arp.c conflict was one commit adding a new
local variable while another commit was deleting one.

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-26 16:08:27 -07:00
Linus Torvalds
518a7cb698 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net
Pull networking fixes from David Miller:

 1) When we run a tap on netlink sockets, we have to copy mmap'd SKBs
    instead of cloning them.  From Daniel Borkmann.

 2) When converting classical BPF into eBPF, fix the setting of the
    source reg to BPF_REG_X.  From Tycho Andersen.

 3) Fix igmpv3/mldv2 report parsing in the bridge multicast code, from
    Linus Lussing.

 4) Fix dst refcounting for ipv6 tunnels, from Martin KaFai Lau.

 5) Set NLM_F_REPLACE flag properly when replacing ipv6 routes, from
    Roopa Prabhu.

 6) Add some new cxgb4 PCI device IDs, from Hariprasad Shenai.

 7) Fix headroom tests and SKB leaks in ipv6 fragmentation code, from
    Florian Westphal.

 8) Check DMA mapping errors in bna driver, from Ivan Vecera.

 9) Several 8139cp bug fixes (dev_kfree_skb_any in interrupt context,
    misclearing of interrupt status in TX timeout handler, etc.) from
    David Woodhouse.

10) In tipc, reset SKB header pointer after skb_linearize(), from Erik
    Hugne.

11) Fix autobind races et al. in netlink code, from Herbert Xu with
    help from Tejun Heo and others.

12) Missing SET_NETDEV_DEV in sunvnet driver, from Sowmini Varadhan.

13) Fix various races in timewait timer and reqsk_queue_hadh_req, from
    Eric Dumazet.

14) Fix array overruns in mac80211, from Johannes Berg and Dan
    Carpenter.

15) Fix data race in rhashtable_rehash_one(), from Dmitriy Vyukov.

16) Fix race between poll_one_napi and napi_disable, from Neil Horman.

17) Fix byte order in geneve tunnel port config, from John W Linville.

18) Fix handling of ARP replies over lightweight tunnels, from Jiri
    Benc.

19) We can loop when fib rule dumps cross multiple SKBs, fix from Wilson
    Kok and Roopa Prabhu.

20) Several reference count handling bug fixes in the PHY/MDIO layer
    from Russel King.

21) Fix lockdep splat in ppp_dev_uninit(), from Guillaume Nault.

22) Fix crash in icmp_route_lookup(), from David Ahern.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (116 commits)
  net: Fix panic in icmp_route_lookup
  net: update docbook comment for __mdiobus_register()
  ppp: fix lockdep splat in ppp_dev_uninit()
  net: via/Kconfig: GENERIC_PCI_IOMAP required if PCI not selected
  phy: marvell: add link partner advertised modes
  net: fix net_device refcounting
  phy: add phy_device_remove()
  phy: fixed-phy: properly validate phy in fixed_phy_update_state()
  net: fix phy refcounting in a bunch of drivers
  of_mdio: fix MDIO phy device refcounting
  phy: add proper phy struct device refcounting
  phy: fix mdiobus module safety
  net: dsa: fix of_mdio_find_bus() device refcount leak
  phy: fix of_mdio_find_bus() device refcount leak
  ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
  ip6_gre: Reduce log level in ip6gre_err() to debug
  fib_rules: fix fib rule dumps across multiple skbs
  bnx2x: byte swap rss_key to comply to Toeplitz specs
  net: revert "net_sched: move tp->root allocation into fw_init()"
  lwtunnel: remove source and destination UDP port config option
  ...
2015-09-26 06:01:33 -04:00
David Ahern
bdb06cbf77 net: Fix panic in icmp_route_lookup
Andrey reported a panic:

[ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
[ 7249.865559] IP: [<c16afeca>] icmp_route_lookup+0xaa/0x320
[ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
[ 7249.865637] Oops: 0000 [#1]
...
[ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.3.0-999-generic #201509220155
[ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014  08/02/2006
[ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
[ 7249.866949] EIP: 0060:[<c16afeca>] EFLAGS: 00210246 CPU: 0
[ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
[ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
[ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
[ 7249.867077]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
[ 7249.867141] Stack:
[ 7249.867165]  320310ee 00000000 00000042 320310ee 00000000 c1aeca00
f3920240 f0c69180
[ 7249.867268]  f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
e659266c f483ba54
[ 7249.867361]  8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
c16b0e00 00000064
[ 7249.867448] Call Trace:
[ 7249.867494]  [<f855058b>] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
[ 7249.867534]  [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867576]  [<f8962f4b>] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867615]  [<c16b0e00>] ? icmp_send+0xa0/0x380
[ 7249.867648]  [<c16b102f>] icmp_send+0x2cf/0x380
[ 7249.867681]  [<f89c8126>] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
[ 7249.867714]  [<f89cd0da>] reject_tg+0x7a/0x9f [ipt_REJECT]
[ 7249.867746]  [<f88c29a7>] ipt_do_table+0x317/0x70c [ip_tables]
[ 7249.867780]  [<f895e0a6>] ? __nf_conntrack_find_get+0x166/0x3b0
[nf_conntrack]
[ 7249.867838]  [<f895eea8>] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
[ 7249.867889]  [<f84c0035>] iptable_filter_hook+0x35/0x80 [iptable_filter]
[ 7249.867933]  [<c16776a1>] nf_iterate+0x71/0x80
[ 7249.867970]  [<c1677715>] nf_hook_slow+0x65/0xc0
[ 7249.868002]  [<c1681811>] __ip_local_out_sk+0xc1/0xd0
[ 7249.868034]  [<c1680f30>] ? ip_forward_options+0x1a0/0x1a0
[ 7249.868066]  [<c1681836>] ip_local_out_sk+0x16/0x30
[ 7249.868097]  [<c1684054>] ip_send_skb+0x14/0x80
[ 7249.868129]  [<c16840f4>] ip_push_pending_frames+0x34/0x40
[ 7249.868163]  [<c16844a2>] ip_send_unicast_reply+0x282/0x310
[ 7249.868196]  [<c16a0863>] tcp_v4_send_reset+0x1b3/0x380
[ 7249.868227]  [<c16a1b63>] tcp_v4_rcv+0x323/0x990
[ 7249.868257]  [<c16776a1>] ? nf_iterate+0x71/0x80
[ 7249.868289]  [<c167dc2b>] ip_local_deliver_finish+0x8b/0x230
[ 7249.868322]  [<c167df4c>] ip_local_deliver+0x4c/0xa0
[ 7249.868353]  [<c167dba0>] ? ip_rcv_finish+0x390/0x390
[ 7249.868384]  [<c167d88c>] ip_rcv_finish+0x7c/0x390
[ 7249.868415]  [<c167e280>] ip_rcv+0x2e0/0x420
...

Prior to the VRF change the oif was not set in the flow struct, so the
VRF support should really have only added the vrf_master_ifindex lookup.

Fixes: 613d09b30f ("net: Use VRF device index for lookups on TX")
Cc: Andrey Melnikov <temnota.am@gmail.com>
Signed-off-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 21:44:02 -07:00
Eric Dumazet
1b70e977ce inet: constify inet_rtx_syn_ack() sock argument
SYNACK packets are sent on behalf on unlocked listeners
or fastopen sockets. Mark socket as const to catch future changes
that might break the assumption.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
ea3bea3a1d tcp/dccp: constify rtx_synack() and friends
This is done to make sure we do not change listener socket
while sending SYNACK packets while socket lock is not held.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
802885fc04 dccp: constify dccp_make_response() socket argument
Like tcp_make_synack() the only time we might change the socket is
when calling sock_wmalloc(), which is using atomic operation to
update sk->sk_wmem_alloc

Also use MAX_DCCP_HEADER as both IPv4/IPv6 use this value for max_header.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
0f935dbedc tcp: constify tcp_v{4|6}_send_synack() socket argument
This documents fact that listener lock might not be held
at the time SYNACK are sent.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:39 -07:00
Eric Dumazet
1c1e9d2b67 ipv6: constify ip6_xmit() sock argument
This is to document that socket lock might not be held at this point.

skb_set_owner_w() and ipv6_local_error() are using proper atomic ops
or spinlocks, so we promote the socket to non const when calling them.

netfilter hooks should never assume socket lock is held,
we also promote the socket to non const.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
5d062de7f8 tcp: constify tcp_make_synack() socket argument
listener socket is not locked when tcp_make_synack() is called.

We better make sure no field is written.

There is one exception : Since SYNACK packets are attached to the listener
at this moment (or SYN_RECV child in case of Fast Open),
sock_wmalloc() needs to update sk->sk_wmem_alloc, but this is done using
atomic operations so this is safe.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
6ac705b180 tcp: remove tcp_ecn_make_synack() socket argument
SYNACK packets might be sent without holding socket lock.

For DCTCP/ECN sake, we should call INET_ECN_xmit() while
socket lock is owned, and only when we init/change congestion control.

This also fixies a bug if congestion module is changed from
dctcp to another one on a listener : we now clear ECN bits
properly.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
37bfbdda0b tcp: remove tcp_synack_options() socket argument
We do not use the socket in this function.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
cfe673b0ae ip: constify ip_build_and_send_pkt() socket argument
This function is used to build and send SYNACK packets,
possibly on behalf of unlocked listener socket.

Make sure we did not miss a write by making this socket const.

We no longer can use ip_select_ident() and have to either
set iph->id to 0 or directly call __ip_select_ident()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
b83e3deb97 tcp: md5: constify tcp_md5_do_lookup() socket argument
When TCP new listener is done, these functions will be called
without socket lock being held. Make sure they don't change
anything.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:38 -07:00
Eric Dumazet
30d50c61df ipv6: constify inet6_csk_route_req() socket argument
socket is not modified, make it const so that callers can
do the same if they need.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet
3aef934f4d ipv6: constify ip6_dst_lookup_{flow|tail}() sock arguments
ip6_dst_lookup_flow() and ip6_dst_lookup_tail() do not touch
socket, lets add a const qualifier.

This will permit the same change in inet6_csk_route_req()

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet
e5895bc600 inet: constify inet_csk_route_req() socket argument
This is used by TCP listener core, and listener socket shall
not be modified by inet_csk_route_req().

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet
6f9c961546 inet: constify ip_route_output_flow() socket argument
Very soon, TCP stack might call inet_csk_route_req(), which
calls inet_csk_route_req() with an unlocked listener socket,
so we need to make sure ip_route_output_flow() is not trying to
change any field from its socket argument.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:37 -07:00
Eric Dumazet
b1964b5fce tcp: constify tcp_openreq_init_rwin()
Soon, listener socket wont be locked when tcp_openreq_init_rwin()
is called. We need to read socket fields once, as their value
could change under us.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Eric Dumazet
b40cf18ef7 tcp: constify listener socket in tcp_v[46]_init_req()
Soon, listener socket spinlock will no longer be held,
add const arguments to tcp_v[46]_init_req() to make clear these
functions can not mess socket fields.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 13:00:36 -07:00
Michal Kubeček
6ea29da1d0 net: remove unused argument of __netdev_find_adj()
The __netdev_find_adj() helper does not use its first argument, only the
device to find and list to walk through.

Signed-off-by: Michal Kubecek <mkubecek@suse.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:35:15 -07:00
stephen hemminger
163c2e252f l2tp: auto load IP modules
When creating a IP encapsulated tunnel the necessary l2tp module
should be loaded. It already works for UDP encapsulation, it just
doesn't work for direct IP encap.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:27:22 -07:00
stephen hemminger
f1f39f9110 l2tp: auto load type modules
It should not be necessary to do explicit module loading when
configuring L2TP. Modules should be loaded as needed instead
(as is done already with netlink and other tunnel types).

This patch adds a new module alias type and code to load
the sub module on demand.

Signed-off-by: Stephen Hemminger <stephen@networkplumber.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:27:22 -07:00
Florian Fainelli
f37db85d0c net: dsa: Set a "dsa" device_type
Provide a device_type information for slave network devices created by
DSA, this is useful for user-space application to easily locate/search
for devices of a specific kind.

Signed-off-by: Florian Fainelli <f.fainelli@gmail.com>
Reviewed-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-25 12:25:16 -07:00
Linus Torvalds
101688f534 NFS client bugfixes for Linux 4.3
Highlights include:
 
 Stable patches:
 - fix v4.2 SEEK on files over 2 gigs
 - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
 - Fix recovery of recalled read delegations
 
 Bugfixes:
 - Fix a case where NFSv4 fails to send CLOSE after a server reboot
 - Fix sunrpc to wait for connections to complete before retrying
 - Fix sunrpc races between transport connect/disconnect and shutdown
 - Fix an infinite loop when layoutget fail with BAD_STATEID
 - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
 - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
 - Fix layoutreturn/close ordering issues.
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v1
 
 iQIcBAABAgAGBQJWBWKdAAoJEGcL54qWCgDy2rUP/iIWUQSpUPfKKw7xquQUQe4j
 ci4nFxpJ/zhKj1u7x3wrkxZAXcEooYo+ZJ7ayzROKcfQL/sUSWGbSLdr3mqrQynv
 b0SDmnJK9V+CdBQrA+Jp5UGQxcumpMxsAfqVznT0qkf/wDp44DCVgDz5Aj8cRbWU
 6xPfMgVLEnXiId9IgKqg3sJ2NmvMZXuI9sHM6hp6OzRmQDjTcx+LgRz7tnQHgaEk
 zGz8R6eDm3OA0wfApqZwJ6JY793HsDdy30W9L0Yi2PVGXfzwoEB8AqgLVwSDIY1B
 5hG5zn3tg9PSz9vhJ7M2h4AgFHdB3w3XGdJUafwqZEeqEIagw1iFCWlMyo/lE2dG
 G7oob9Jiiwxjc3RDWn2wGaafymrrWZwl2nYzC4O3UvJ3hVJ0mEl1iJagK1m8LzfN
 fmnP7tTyPuoOXkzDogZ0YI3FrngO6430PoR2hUPkS1yce/a+IV0HQEmXbSDSwN80
 1d9zyC9TnPj6rFjZeaGxGK17BpkC0oIQCPq4OSJB4396wzAwMqoJjJVVWWeAK4UC
 PxzoXqAAaBFguSsDbuBMcXgiuUw/7DIZ/pdzsWSiCFgocgF5ZdJdieCNtGk0nbLM
 37R7HCauF93JDrkpUMKPnLXScb2IbEh31pFtKzptJYKwMxEiScXXiP3NE9hfX65i
 2zLkl2aBvd154RvVKNbp
 =GdeV
 -----END PGP SIGNATURE-----

Merge tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs

Pull NFS client bugfixes from Trond Myklebust:
 "Highlights include:

  Stable patches:
   - fix v4.2 SEEK on files over 2 gigs
   - Fix a layout segment reference leak when pNFS I/O falls back to inband I/O.
   - Fix recovery of recalled read delegations

  Bugfixes:
   - Fix a case where NFSv4 fails to send CLOSE after a server reboot
   - Fix sunrpc to wait for connections to complete before retrying
   - Fix sunrpc races between transport connect/disconnect and shutdown
   - Fix an infinite loop when layoutget fail with BAD_STATEID
   - nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
   - Fix a bogus WARN_ON_ONCE() in O_DIRECT when layout commit_through_mds is set
   - Fix layoutreturn/close ordering issues"

* tag 'nfs-for-4.3-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs:
  NFS41: make close wait for layoutreturn
  NFS: Skip checking ds_cinfo.buckets when lseg's commit_through_mds is set
  NFSv4.x/pnfs: Don't try to recover stateids twice in layoutget
  NFSv4: Recovery of recalled read delegations is broken
  NFS: Fix an infinite loop when layoutget fail with BAD_STATEID
  NFS: Do cleanup before resetting pageio read/write to mds
  SUNRPC: xs_sock_mark_closed() does not need to trigger socket autoclose
  SUNRPC: Lock the transport layer on shutdown
  nfs/filelayout: Fix NULL reference caused by double freeing of fh_array
  SUNRPC: Ensure that we wait for connections to complete before retrying
  SUNRPC: drop null test before destroy functions
  nfs: fix v4.2 SEEK on files over 2 gigs
  SUNRPC: Fix races between socket connection and destroy code
  nfs: fix pg_test page count calculation
  Failing to send a CLOSE if file is opened WRONLY and server reboots on a 4.x mount
2015-09-25 11:33:52 -07:00
Chuck Lever
bb6c96d728 xprtrdma: Replace global lkey with lkey local to PD
The core API has changed so that devices that do not have a global
DMA lkey automatically create an mr, per-PD, and make that lkey
available. The global DMA lkey interface is going away in favor of
the per-PD DMA lkey.

The per-PD DMA lkey is always available. Convert xprtrdma to use the
device's per-PD DMA lkey for regbufs, no matter which memory
registration scheme is in use.

Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Signed-off-by: Sagi Grimberg <sagig@mellanox.com>
Cc: linux-nfs <linux-nfs@vger.kernel.org>
Acked-by: Anna Schumaker <Anna.Schumaker@Netapp.com>
Signed-off-by: Doug Ledford <dledford@redhat.com>
2015-09-25 10:46:51 -04:00
Russell King
9861f72074 net: fix net_device refcounting
of_find_net_device_by_node() uses class_find_device() internally to
lookup the corresponding network device.  class_find_device() returns
a reference to the embedded struct device, with its refcount
incremented.

Add a comment to the definition in net/core/net-sysfs.c indicating the
need to drop this refcount, and fix the DSA code to drop this refcount
when the OF-generated platform data is cleaned up and freed.  Also
arrange for the ref to be dropped when handling errors.

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 23:04:53 -07:00
Russell King
e496ae690b net: dsa: fix of_mdio_find_bus() device refcount leak
Current users of of_mdio_find_bus() leak a struct device refcount, as
they fail to clean up the reference obtained inside class_find_device().

Fix the DSA code to properly refcount the returned MDIO bus by:
1. taking a reference on the struct device whenever we assign it to
   pd->chip[x].host_dev.
2. dropping the reference when we overwrite the existing reference.
3. dropping the reference when we free the data structure.
4. dropping the initial reference we obtained after setting up the
   platform data structure, or on failure.

In step 2 above, where we obtain a new MDIO bus, there is no need to
take a reference on it as we would only have to drop it immediately
after assignment again, iow:

	put_device(cd->host_dev);	/* drop original assignment ref */
	cd->host_dev = get_device(&mdio_bus_switch->dev); /* get our ref */
	put_device(&mdio_bus_switch->dev); /* drop of_mdio_find_bus ref */

Signed-off-by: Russell King <rmk+kernel@arm.linux.org.uk>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 23:04:52 -07:00
Jiri Pirko
f623ab7f51 switchdev: reduce transaction phase enum down to a boolean
Now, since we have only 2 values for transaction phase, just use bool.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko
79a62eb22a dsa: use prepare/commit switchdev transaction helpers
The enum is going to disappear, use the helpers instead.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko
9f6467cf22 switchdev: remove "ABORT" transaction phase
No longer used by drivers, as transaction queue with item destructors
takes care of abort phase internally in switchdev code. So kill it.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:22 -07:00
Jiri Pirko
f8db83486e switchdev: move transaction phase enum under transaction structure
Before it disappears completely, move transaction phase enum under
transaction structure and make attr/obj structures a bit cleaner.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Jiri Pirko
7ea6eb3f56 switchdev: introduce transaction item queue for attr_set and obj_add
Now, the memory allocation in prepare/commit state is done separatelly
in each driver (rocker). Introduce the similar mechanism in generic
switchdev code, in form of queue. That can be used not only for memory
allocations, but also for different items. Abort item destruction
is handled as well.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Jiri Pirko
69f5df491e switchdev: rename "trans" to "trans_ph".
This is temporary, name "trans" will be used for something else and
"trans_ph" will eventually disappear.

Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 22:59:21 -07:00
Pablo Neira Ayuso
c3456026ad Merge tag 'ipvs2-for-v4.4' of https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next
Simon Horman says:

====================
Second Round of IPVS Updates for v4.4

please consider these bug fixes and extensive clean-ups of IPVS
from Eric Biederman for v4.4.

His excellent description of the changes, which is part of an even larger
set of clean-up work, is as follows:

  I am gradually working my way through the netfilter stack passing struct
  down into the netfilter hooks and from the netfilter hooks and from there
  down into the functions that actually care.  This removes the need for
  netfilter functions to guess how to figure out how to compute which
  network namespace they are in and instead provides a simple and reliable
  method to do so.

  The cleanups stand on their own but this is part of a larger effort to
  have routes with an output device that is not in the current network
  namespace.

  The IPVS code has been a bit more of a challenge than most.  Just passing
  struct net through to where it is needed did not feel clean to me.  The
  practical issue is that the ipvs code in most places actually wants
  struct netns_ipvs and not struct net.

  So as part of this process I have turned the relationship between struct
  net and the structs netns_ipvs, ip_vs_conn_param, ip_vs_conn, and
  ip_vs_service inside out.  I have modified the ipvs functions to take a
  struct netns_ipvs not a struct net.  The net is code with fewer
  conversions from one type of structure to another.  I did wind up adding
  a struct netns_ipvs parameter to quite a few functions that did not have
  it before so I could pass the structure down from the netfilter hooks to
  where it is actually needed to avoid guessing.

  I have broken up the work in a bunch of small patches so there is at
  least a chance and reviewing that each step I took is correct.  The
  series compiles at each step so bisecting it should not be a problem if
  something weird comes up.

  The first two changes in this series are actually bug fixes.  The first
  is a compile fix for a bug in sctp that came in, in the last round of
  ipvs changes merged into nf-next.  The second fixes an older bug where in
  pathological circumstances the wrong network namespace could be used when
  a proc file is written to.

  The rest of the patchset is a bunch of boring changes getting pushing
  struct netns_ipvs (and by extension ipvs->net) where it needs to be.
  Either by replacing struct net pointers or adding new struct netns_ipvs
  pointers.  With a handful of other minor cleanups (like removing
  skb_net).

I have decided include the bug fixes in this pull request. Patch one
relates to a bug that was added to nf-next recently and is thus not
applicable to nf . Patch two could arguably be promoted to a fix for v4.3
and stable though it does not appear to be severe enough to warrant that
course of action; let me know if you would like me to reconsider.
====================

Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2015-09-25 01:38:58 +02:00
Matt Bennett
17a10c9215 ip6_tunnel: Reduce log level in ip6_tnl_err() to debug
Currently error log messages in ip6_tnl_err are printed at 'warn'
level. This is different to other tunnel types which don't print
any messages. These log messages don't provide any information that
couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 16:08:37 -07:00
David S. Miller
deccbe80be Just two small fixes:
* VHT MCS mask array overrun, reported by Dan Carpenter
  * reset CQM history to always get a notification, from Sara Sharon
 -----BEGIN PGP SIGNATURE-----
 
 iQIcBAABCAAGBQJWAWENAAoJEDBSmw7B7bqr7agQAI/IVkuKlzu24XFg6PbaUNcL
 cD+jZplmg3OeeUf/Gtv7KVLAbRlo1X+qIiuquewzGhqAOGSm3KmypQmKQy7Y1gMO
 Nq4uJwdkJpo2aQtQH97jXwWfk40d+BqSXNbnpobvzrMWaMXHO+DT18Icc3iJqO9l
 cQOUTNplDMTpXgjHXjf6296MVP30lbUnoI6BPJGb3LS3BvfgX8KYR4XGlRCQ5R1F
 4gZE0IokY+PQmbZvBzXD3dBGuwnztezX8g25IytgFs+UDyeg8xn/OI+Q3e4HMpqD
 4SeNpD3AD1jxEedpr9trCaLCm8NQdnfU4LDE0y5qaTC2cWPTw1qZs4alNlbD+L3L
 odjiP1aU2X9rWTf78/WcJwn+aufBNGz20M9at5IWkUCi88taSGWzZAviRhhGpuzd
 4c2v6DLcaAqtTTGhhN6CUE4f9+JzeICgDjqSmdR90caMAtOPFjS2ZpbMstKVxO1V
 fjJ4Ys/VRsySrSIji/ryPNUSigeNhEJm9jImObcKhsEv6aqdIbYzdDWekbSVDtvB
 nJQJqlI1RlNn+HO+VXvPDIBuSpOXddVO6142yLQtte4ReL9NbZFU0lx7U5N6JoIk
 QfXbC9eD/LxdRONxLvakblWWbm+FflCaTiEi9dL+zQeAAY3Zgyfi2vo4bHdwi3XL
 GMNNo05q5WSFGkfktb8v
 =4Ui5
 -----END PGP SIGNATURE-----

Merge tag 'mac80211-for-davem-2015-09-22' of git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211

Johannes Berg says:

====================
Just two small fixes:
 * VHT MCS mask array overrun, reported by Dan Carpenter
 * reset CQM history to always get a notification, from Sara Sharon
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:36:20 -07:00
Matt Bennett
a46496ce38 ip6_gre: Reduce log level in ip6gre_err() to debug
Currently error log messages in ip6gre_err are printed at 'warn'
level. This is different to most other tunnel types which don't
print any messages. These log messages don't provide any information
that couldn't be deduced with networking tools. Also it can be annoying
to have one end of the tunnel go down and have the logs fill with
pointless messages such as "Path to destination invalid or inactive!".

This patch reduces the log level of these messages to 'dbg' level to
bring the visible behaviour into line with other tunnel types.

Signed-off-by: Matt Bennett <matt.bennett@alliedtelesis.co.nz>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:28:43 -07:00
Wilson Kok
41fc014332 fib_rules: fix fib rule dumps across multiple skbs
dump_rules returns skb length and not error.
But when family == AF_UNSPEC, the caller of dump_rules
assumes that it returns an error. Hence, when family == AF_UNSPEC,
we continue trying to dump on -EMSGSIZE errors resulting in
incorrect dump idx carried between skbs belonging to the same dump.
This results in fib rule dump always only dumping rules that fit
into the first skb.

This patch fixes dump_rules to return error so that we exit correctly
and idx is correctly maintained between skbs that are part of the
same dump.

Signed-off-by: Wilson Kok <wkok@cumulusnetworks.com>
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 15:21:54 -07:00
Eric Dumazet
d8ed625044 tcp: factorize sk_txhash init
Neal suggested to move sk_txhash init into tcp_create_openreq_child(),
called both from IPv4 and IPv6.

This opportunity was missed in commit 58d607d3e5 ("tcp: provide
skb->hash to synack packets")

Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:52:30 -07:00
WANG Cong
d8aecb1011 net: revert "net_sched: move tp->root allocation into fw_init()"
fw filter uses tp->root==NULL to check if it is the old method,
so it doesn't need allocation at all in this case. This patch
reverts the offending commit and adds some comments for old
method to make it obvious.

Fixes: 33f8b9ecdb ("net_sched: move tp->root allocation into fw_init()")
Reported-by: Akshat Kakkar <akshat.1984@gmail.com>
Cc: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com>
Acked-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:33:30 -07:00
Jiri Benc
b194f30c61 lwtunnel: remove source and destination UDP port config option
The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.

As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.

Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).

If in the future we rework ARP handling to allow port specification, the
attributes can be added back.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:31:37 -07:00
Jiri Benc
63d008a4e9 ipv4: send arp replies to the correct tunnel
When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.

We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.

The only thing to do is to "reverse" the ip_tunnel_info.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Acked-by: Thomas Graf <tgraf@suug.ch>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 14:31:36 -07:00
Jiri Benc
38cf595b19 ipv6: remove unused neigh parameter from ndisc functions
Since commit 12fd84f438 ("ipv6: Remove unused neigh argument for
icmp6_dst_alloc() and its callers."), the neigh parameter of ndisc_send_na
and ndisc_send_ns is unused.

CC: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:26:08 -07:00
Jiri Benc
92c14d9b5e genetlink: simplify genl_notify
The genl_notify function has too many arguments for no real reason - all
callers use genl_info to get them anyway. Just pass the genl_info down to
genl_notify.

Signed-off-by: Jiri Benc <jbenc@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:25:23 -07:00
Herbert Xu
da314c9923 netlink: Replace rhash_portid with bound
On Mon, Sep 21, 2015 at 02:20:22PM -0400, Tejun Heo wrote:
>
> store_release and load_acquire are different from the usual memory
> barriers and can't be paired this way.  You have to pair store_release
> and load_acquire.  Besides, it isn't a particularly good idea to

OK I've decided to drop the acquire/release helpers as they don't
help us at all and simply pessimises the code by using full memory
barriers (on some architectures) where only a write or read barrier
is needed.

> depend on memory barriers embedded in other data structures like the
> above.  Here, especially, rhashtable_insert() would have write barrier
> *before* the entry is hashed not necessarily *after*, which means that
> in the above case, a socket which appears to have set bound to a
> reader might not visible when the reader tries to look up the socket
> on the hashtable.

But you are right we do need an explicit write barrier here to
ensure that the hashing is visible.

> There's no reason to be overly smart here.  This isn't a crazy hot
> path, write barriers tend to be very cheap, store_release more so.
> Please just do smp_store_release() and note what it's paired with.

It's not about being overly smart.  It's about actually understanding
what's going on with the code.  I've seen too many instances of
people simply sprinkling synchronisation primitives around without
any knowledge of what is happening underneath, which is just a recipe
for creating hard-to-debug races.

> > @@ -1539,7 +1546,7 @@ static int netlink_bind(struct socket *sock, struct sockaddr *addr,
> >  		}
> >  	}
> >
> > -	if (!nlk->portid) {
> > +	if (!nlk->bound) {
>
> I don't think you can skip load_acquire here just because this is the
> second deref of the variable.  That doesn't change anything.  Race
> condition could still happen between the first and second tests and
> skipping the second would lead to the same kind of bug.

The reason this one is OK is because we do not use nlk->portid or
try to get nlk from the hash table before we return to user-space.

However, there is a real bug here that none of these acquire/release
helpers discovered.  The two bound tests here used to be a single
one.  Now that they are separate it is entirely possible for another
thread to come in the middle and bind the socket.  So we need to
repeat the portid check in order to maintain consistency.

> > @@ -1587,7 +1594,7 @@ static int netlink_connect(struct socket *sock, struct sockaddr *addr,
> >  	    !netlink_allowed(sock, NL_CFG_F_NONROOT_SEND))
> >  		return -EPERM;
> >
> > -	if (!nlk->portid)
> > +	if (!nlk->bound)
>
> Don't we need load_acquire here too?  Is this path holding a lock
> which makes that unnecessary?

Ditto.

---8<---
The commit 1f770c0a09 ("netlink:
Fix autobind race condition that leads to zero port ID") created
some new races that can occur due to inconcsistencies between the
two port IDs.

Tejun is right that a barrier is unavoidable.  Therefore I am
reverting to the original patch that used a boolean to indicate
that a user netlink socket has been bound.

Barriers have been added where necessary to ensure that a valid
portid and the hashed socket is visible.

I have also changed netlink_insert to only return EBUSY if the
socket is bound to a portid different to the requested one.  This
combined with only reading nlk->bound once in netlink_bind fixes
a race where two threads that bind the socket at the same time
with different port IDs may both succeed.

Fixes: 1f770c0a09 ("netlink: Fix autobind race condition that leads to zero port ID")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Nacked-by: Tejun Heo <tj@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-24 12:07:08 -07:00
Eric W. Biederman
57781c1cee ipvs: Pass ipvs into ip_vs_gather_frags
This will be needed later when the network namespace guessing is
removed from ip_defrag.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
9cfdd75b7c ipvs: Remove skb_sknet
This function adds no real value and it obscures what the code is doing.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
7d1f88eca0 ipvs: Pass ipvs not net to ip_vs_protocol_net_(init|cleanup)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
69f390934b ipvs: Remove net argument from ip_vs_tcp_conn_listen
The argument is unnecessary and in practice confusing,
and has caused the callers to do all manner of silly things.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
a43d1a6b97 ipvs: Pass ipvs through ip_vs_route_me_harder into sysctl_snat_reroute
This removes the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
7b5f689a2c ipvs: Pass ipvs into ip_vs_out_icmp and ip_vs_out_icmp_v6
This removes the need to compute ipvs with the hack "net_ipvs(skb_net(skb))"

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
6f2bcea991 ipvs: Pass ipvs into ip_vs_in_icmp and ip_vs_in_icmp_v6
With ipvs passed into ip_vs_in_icmp and ip_vs_in_icmp_v6
they no longer need to call the hack that is skb_net.

Additionally ipvs_in_icmp no longer needs to call dev_net(skb->dev)
and can use the ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
6e385bb3ef ipvs: Pass ipvs into ip_vs_in
Derive ipvs from state->net in the callers of ip_vs_in and pass it
into ip_vs_out.  Removing the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:43 +09:00
Eric W. Biederman
1b75097dd7 ipvs: Pass ipvs into ip_vs_out
Derive ipvs from state->net in the callers of ip_vs_out and pass it
into ip_vs_out.  Removing the need to use the hack skb_net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
2300f0451e ipvs: Pass ipvs not net into sysctl_nat_icmp_send
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
51efbcbbb2 ipvs: Simplify ipvs and net access in ip_vs_leave
Stop using the hack skb_net(skb) to compute the network namespace.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
5703294874 ipvs: Wrap sysctl_cache_bypass and remove ifdefs in ip_vs_leave
With sysctl_cache_bypass now a compile time constant the compiler can
figue out that it can elimiate all of the code that depends on
sysctl_cache_bypass being true.

Also remove the duplicate computation of net previously necessitated
by #ifdef CONFIG_SYSCTL

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
7c08d78e6f ipvs: Better derivation of ipvs in ip_vs_in_stats and ip_vs_out_stats
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
20868a40d0 ipvs: Pass ipvs into ensure_mtu_is adequate
This allows two different ways for computing/guessing net to be
removed from ensure_mtu_is_adequate.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
f5745f8ae6 ipvs: Pass ipvs into __ip_vs_get_out_rt_v6
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:42 +09:00
Eric W. Biederman
ecfe87b884 ipvs: Pass ipvs into __ip_vs_get_out_rt
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
361c3f5293 ipvs: Better derivation of ipvs in ip_vs_tunnel_xmit
Don't use "net_ipvs(skb_net(skb))" as skb_net is a bad hack.  Instead
use cp->ipvs and ipvs->net for the net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
d8f44c335a ipvs: Pass ipvs into .conn_schedule and ip_vs_try_to_schedule
This moves the hack "net_ipvs(skb_net(skb))" up one level where it
will be easier to remove.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
2f3edc6a5b ipvs: Pass ipvs not net into ip_vs_conn_net_init and ip_vs_conn_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
d889717aaf ipvs: Pass ipvs not net into ip_vs_conn_net_flush
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
754b81a357 ipvs: Pass ipvs not net to ip_vs_conn_hashkey
Use the address of struct netns_ipvs in the hash not the address of
struct net.  Both addresses are equally valid candidates and by using
the address of struct netns_ipvs there becomes no need deal with
struct net in this part of the code.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
0cf705c8c2 ipvs: Pass ipvs into conn_out_get
Move the hack of relying on "net_ipvs(skb_net(skb))" to derive the
ipvs up a layer.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
ab16197642 ipvs: Pass ipvs into .conn_in_get and ip_vs_conn_in_get_proto
Stop relying on "net_ipvs(skb_net(skb))" to derive the ipvs as
skb_net is a hack.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:41 +09:00
Eric W. Biederman
f5099dd4d9 ipvs: Pass ipvs into ip_vs_conn_fill_param_proto
Move the ugly hack net_ipvs(skb_net(skb)) up a layer in the call stack
so it is easier to remove.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
1281a9c2d1 ipvs: Pass ipvs not net into init_netns and exit_netns
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
c70bd6800a ipvs: Pass ipvs not net into [un]register_ip_vs_proto_netns
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
b5dd212cc1 ipvs: Pass ipvs not net into ip_vs_app_net_init and ip_vs_app_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
09858708e6 ipvs: Pass ipvs not net into ip_vs_app_inc_release
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
9f8128a56e ipvs: Pass ipvs not net to register_ip_vs_app and unregister_ip_vs_app
Also move the tests for net_ipvs being NULL into __ip_vs_ftp_init
and __ip_vs_ftp_exit.  The only places where they possibly make
sense.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
3250dc9c52 ipvs: Pass ipvs not net to register_ip_vs_app_inc
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
a080ce38a0 ipvs: Pass ipvs not net into ip_vs_app_inc_new
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:40 +09:00
Eric W. Biederman
19648918fb ipvs: Pass ipvs not net into register_app and unregister_app
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
a4dd0360c6 ipvs: Pass ipvs not net to ip_vs_estimator_net_init and ip_vs_estimator_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
70a131a2c8 ipvs: Pass ipvs not net to estimation_timer
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
3d99376689 ipvs: Pass ipvs not net into ip_vs_control_net_(init|cleanup)
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
8b8237a581 ipvs: Pass ipvs not net to ip_vs_control_net_(init|cleanup)_sysctl
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
423b55954d ipvs: Pass ipvs not net to ip_vs_random_drop_entry
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
0f34d54bf4 ipvs: Pass ipvs not net to ip_vs_start_estimator aned ip_vs_stop_estimator
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
cacd1e60f1 ipvs: Pass ipvs not net to ip_vs_genl_set_config
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:39 +09:00
Eric W. Biederman
ebea1f7c0b ipvs: Pass ipvs not net to ip_vs_sync_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
802cb43703 ipvs: Pass ipvs not net to ip_vs_sync_net_init
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
1fc12004d2 ipvs: Pass ipvs not net to ip_vs_proc_sync_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
4f30665bac ipvs: Pass ipvs not net to ip_vs_proc_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
b61a8c1a40 ipvs: Pass ipvs not net to ip_vs_sync_conn
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
72e9481e28 ipvs: Pass ipvs not net to ip_vs_sync_conn_v0
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
7d537f3ab7 ipvs: Pass ipvs not net to ip_vs_process_message
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
37b68e6ded ipvs: Store ipvs not net in struct ip_vs_sync_thread_data
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of tinfo->net to access tinfo->ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:38 +09:00
Eric W. Biederman
fd124e2f8b ipvs: Pass ipvs not net to make_receive_sock
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
68c76b6aa0 ipvs: Pass ipvs not net to make_send_sock
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
b3cf3cbfb5 ipvs: Pass ipvs not net to stop_sync_thread
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
6ac121d710 ipvs: Pass ipvs not net to start_sync_thread
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
df04ffb766 ipvs: Pass ipvs not net to ip_vs_genl_del_daemon
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
d8443c5f2b ipvs: Pass ipvs not net to ip_vs_genl_new_daemon
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
34c2f5146c ipvs: Pass ipvs not net to ip_vs_genl_find_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
613fb830b7 ipvs: Pass ipvs not net to ip_vs_genl_parse_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:37 +09:00
Eric W. Biederman
af5403419d ipvs: Pass ipvs not net to __ip_vs_get_timeouts
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman
08fff4c357 ipvs: Pass ipvs not net to __ip_vs_get_dest_entries
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman
b2876b7773 ipvs: Pass ipvs not net to __ip_vs_get_service_entries
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:36 +09:00
Eric W. Biederman
f1faa1e749 ipvs: Pass ipvs not net to ip_vs_set_timeout
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
18d6ade63c ipvs: Pass ipvs not net to ip_vs_proto_data_get
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
a47b430080 ipvs: Cache ipvs in ip_vs_in_icmp and ip_vs_in_icmp_v6
Storte the value of net_ipvs in a variable named ipvs so that when
there are more users struct netns_ipvs in ip_vs_in_cmp and
ip_vs_in_icmp_v6 they won't need to compute the value again.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
c60856c687 ipvs: Pass ipvs not net to ip_vs_zero_all
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
56d2169b77 ipvs: Pass ipvs not net to ip_vs_service_net_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
ef7c599d91 ipvs: Pass ipvs not net to ip_vs_flush
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
5060bd8307 ipvs: Pass ipvs not net to ip_vs_add_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
cd58278bd4 ipvs: Cache ipvs in ip_vs_genl_set_cmd
Compute ipvs early in ip_vs_genl_set_cmd and use the cached value to
access ipvs->sync_state.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:35 +09:00
Eric W. Biederman
8e743f1b45 ipvs: Pass ipvs not net to ip_vs_dest_trash_expire
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
79ac82e0aa ipvs: Pass ipvs not net to __ip_vs_del_dest
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
6c0e14f507 ipvs: Pass ipvs not net to ip_vs_trash_cleanup
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
dc2add6f2e ipvs: Pass ipvs not net to ip_vs_find_dest
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
48aed1b029 ipvs: Pass ipvs not net to ip_vs_has_real_service
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
0a4fd6ce92 ipvs: Pass ipvs not net to ip_vs_service_find
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
bb2e2a8c95 ipvs: Pass ipvs not net to __ip_vs_service_find
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
ba61f39034 ipvs: Pass ipvs not net to ip_vs_svc_hashkey
Use the address of ipvs not the address of net when computing the
hash value.  This removes an unncessary dependency on struct net.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:34 +09:00
Eric W. Biederman
1ed8b94780 ipvs: Pass ipvs not net to __ip_vs_svc_fwm_find
ipvs is what the code actually wants to use.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
f6510b245e ipvs: Pass ipvs not net to ip_vs_svc_fwm_hashkey
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
3109d2f2d1 ipvs: Store ipvs not net in struct ip_vs_service
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of param->net to access param->ipvs->net instead.

In functions where we are searching for an svc and filtering by net
filter by ipvs instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
19913dec1b ipvs: Pass ipvs not net to ip_vs_fill_conn
ipvs is what is actually desired so change the parameter and the modify
the callers to pass struct netns_ipvs.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
e64e2b460c ipvs: Store ipvs not net in struct ip_vs_conn_param
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of param->net to access param->ipvs->net instead.

When lookup up struct ip_vs_conn in a hash table replace comparisons
of cp->net with comparisons of cp->ipvs which is possible
now that ipvs is present in ip_vs_conn_param.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
58dbc6f260 ipvs: Store ipvs not net in struct ip_vs_conn
In practice struct netns_ipvs is as meaningful as struct net and more
useful as it holds the ipvs specific data.  So store a pointer to
struct netns_ipvs.

Update the accesses of conn->net to access conn->ipvs->net instead.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
d484fc3812 ipvs: Use state->net in the ipvs forward functions
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
717e917ddf ipvs: Don't use current in proc_do_defense_mode
Instead store ipvs in extra2 so that proc_do_defense_mode can easily
find the ipvs that it's value is associated with.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:33 +09:00
Eric W. Biederman
1daea8ed16 ipvs: Hoist computation of ipvs earlier in sctp_conn_schedule
The addition of sysctl_sloppy_sctp in sctp_conn_schedule resulted
in a use of ipvs before it was computed.  Hoist the computation of
ipvs earlier to avoid this problem.

Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>
Acked-by: Julian Anastasov <ja@ssi.bg>
Signed-off-by: Simon Horman <horms@verge.net.au>
2015-09-24 09:34:32 +09:00
Siva Mannem
dcd45e0649 bridge: don't age externally added FDB entries
Signed-off-by: Siva Mannem <siva.mannem.lnx@gmail.com>
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Vivien Didelot <vivien.didelot@savoirfairelinux.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Acked-by: Premkumar Jonnala <pjonnala@broadcom.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:35:58 -07:00
Scott Feldman
a79e88d9fb bridge: define some min/max/default ageing time constants
Signed-off-by: Scott Feldman <sfeldma@gmail.com>
Acked-by: Jiri Pirko <jiri@resnulli.us>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:35:58 -07:00
David Woodhouse
d3869efe7a Fix AF_PACKET ABI breakage in 4.2
Commit 7d82410950 ("virtio: add explicit big-endian support to memory
accessors") accidentally changed the virtio_net header used by
AF_PACKET with PACKET_VNET_HDR from host-endian to big-endian.

Since virtio_legacy_is_little_endian() is a very long identifier,
define a vio_le macro and use that throughout the code instead of the
hard-coded 'false' for little-endian.

This restores the ABI to match 4.1 and earlier kernels, and makes my
test program work again.

Signed-off-by: David Woodhouse <David.Woodhouse@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:33:55 -07:00
Neil Horman
2d8bff1269 netpoll: Close race condition between poll_one_napi and napi_disable
Drivers might call napi_disable while not holding the napi instance poll_lock.
In those instances, its possible for a race condition to exist between
poll_one_napi and napi_disable.  That is to say, poll_one_napi only tests the
NAPI_STATE_SCHED bit to see if there is work to do during a poll, and as such
the following may happen:

CPU0				CPU1
ndo_tx_timeout			napi_poll_dev
 napi_disable			 poll_one_napi
  test_and_set_bit (ret 0)
				  test_bit (ret 1)
   reset adapter		   napi_poll_routine

If the adapter gets a tx timeout without a napi instance scheduled, its possible
for the adapter to think it has exclusive access to the hardware  (as the napi
instance is now scheduled via the napi_disable call), while the netpoll code
thinks there is simply work to do.  The result is parallel hardware access
leading to corrupt data structures in the driver, and a crash.

Additionaly, there is another, more critical race between netpoll and
napi_disable.  The disabled napi state is actually identical to the scheduled
state for a given napi instance.  The implication being that, if a napi instance
is disabled, a netconsole instance would see the napi state of the device as
having been scheduled, and poll it, likely while the driver was dong something
requiring exclusive access.  In the case above, its fairly clear that not having
the rings in a state ready to be polled will cause any number of crashes.

The fix should be pretty easy.  netpoll uses its own bit to indicate that that
the napi instance is in a state of being serviced by netpoll (NAPI_STATE_NPSVC).
We can just gate disabling on that bit as well as the sched bit.  That should
prevent netpoll from conducting a napi poll if we convert its set bit to a
test_and_set_bit operation to provide mutual exclusion

Change notes:
V2)
	Remove a trailing whtiespace
	Resubmit with proper subject prefix

V3)
	Clean up spacing nits

Signed-off-by: Neil Horman <nhorman@tuxdriver.com>
CC: "David S. Miller" <davem@davemloft.net>
CC: jmaxwell@redhat.com
Tested-by: jmaxwell@redhat.com
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:32:50 -07:00
Daniel Borkmann
5cf8ca0e47 cls_bpf: further limit exec opcodes subset
Jamal suggested to further limit the currently allowed subset of opcodes
that may be used by a direct action return code as the intention is not
to replace the full action engine, but rather to have a minimal set that
can be used in the fast-path on things like ingress for some features
that cls_bpf supports.

Classifiers can, of course, still be chained together that have direct
action mode with those that have a full exec pass. For more complex
scenarios that go beyond this minimal set here, the full tcf_exts_exec()
path must be used.

Suggested-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann
ef146fa40c cls_bpf: make binding to classid optional
The binding to a particular classid was so far always mandatory for
cls_bpf, but it doesn't need to be. Therefore, lift this restriction
as similarly done in other classifiers.

Only a couple of qdiscs make use of class from the tcf_result, others
don't strictly care, so let the user choose his needs (those that read
out class can handle situations where it could be NULL).

An explicit check for tcf_unbind_filter() is also not needed here, as
the previous r->class was 0, so the xchg() will return that and
therefore a callback to the qdisc's unbind_tcf() is skipped.

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:02 -07:00
Daniel Borkmann
bf007d1c75 cls_bpf: also dump TCA_BPF_FLAGS
In commit 43388da42a49 ("cls_bpf: introduce integrated actions") we
have added TCA_BPF_FLAGS. We can also retrieve this information from
the prog, dump it back to user space as well. It's useful in tc when
displaying/dumping filter info.

Also, remove tp from cls_bpf_prog_from_efd(), came in as a conflict
from a rebase and it's unused here (later work may add it along with
a real user).

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:29:01 -07:00
Daniel Borkmann
927ab1d764 sched, bpf: let stack handle !IFF_UP devs on bpf_clone_redirect
Similarly as already the case in bpf_redirect()/skb_do_redirect()
pair, let the stack deal with devs that are !IFF_UP.

dev_forward_skb() as well as dev_queue_xmit() will free the skb
and increment drop counter internally in such cases, so we can
spare the condition in bpf_clone_redirect().

Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Acked-by: Alexei Starovoitov <ast@plumgrid.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:25:51 -07:00
Eric Dumazet
675ee231d9 tcp: add proper TS val into RST packets
RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both <SYN> and
  <SYN,ACK> contain TSopt, the TSopt MUST be sent in every non-<RST>
  segment for the duration of the connection, and SHOULD be sent in an
  <RST> segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an <RST> segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter <RST> acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d5 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:24:07 -07:00
Tom Herbert
644d0e6569 ipv6 Use get_hash_from_flowi6 for rt6 hash
In rt6_info_hash_nhsfn replace the custom hashing over flowi6 that is
using xor with a call to common function get_hash_from_flowi6.

Signed-off-by: Tom Herbert <tom@herbertland.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-23 14:21:09 -07:00
Neil Armstrong
fbd03513bf net: dsa: Fix Marvell Egress Trailer check
The Marvell Egress rx trailer check must be fixed to
correctly detect bad bits in the third byte of the
Eggress trailer as described in the Table 28 of the
88E6060 datasheet.
The current code incorrectly omits to check the third
byte and checks the fourth byte twice.

Signed-off-by: Neil Armstrong <narmstrong@baylibre.com>
Acked-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 17:37:03 -07:00
Jesse Gross
ae5f2fb1d5 openvswitch: Zero flows on allocation.
When support for megaflows was introduced, OVS needed to start
installing flows with a mask applied to them. Since masking is an
expensive operation, OVS also had an optimization that would only
take the parts of the flow keys that were covered by a non-zero
mask. The values stored in the remaining pieces should not matter
because they are masked out.

While this works fine for the purposes of matching (which must always
look at the mask), serialization to netlink can be problematic. Since
the flow and the mask are serialized separately, the uninitialized
portions of the flow can be encoded with whatever values happen to be
present.

In terms of functionality, this has little effect since these fields
will be masked out by definition. However, it leaks kernel memory to
userspace, which is a potential security vulnerability. It is also
possible that other code paths could look at the masked key and get
uninitialized data, although this does not currently appear to be an
issue in practice.

This removes the mask optimization for flows that are being installed.
This was always intended to be the case as the mask optimizations were
really targetting per-packet flow operations.

Fixes: 03f0d916 ("openvswitch: Mega flow implementation")
Signed-off-by: Jesse Gross <jesse@nicira.com>
Acked-by: Pravin B Shelar <pshelar@nicira.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 17:33:41 -07:00
Andrea Arcangeli
ac5be6b47e userfaultfd: revert "userfaultfd: waitqueue: add nr wake parameter to __wake_up_locked_key"
This reverts commit 51360155ec and adapts
fs/userfaultfd.c to use the old version of that function.

It didn't look robust to call __wake_up_common with "nr == 1" when we
absolutely require wakeall semantics, but we've full control of what we
insert in the two waitqueue heads of the blocked userfaults.  No
exclusive waitqueue risks to be inserted into those two waitqueue heads
so we can as well stick to "nr == 1" of the old code and we can rely
purely on the fact no waitqueue inserted in one of the two waitqueue
heads we must enforce as wakeall, has wait->flags WQ_FLAG_EXCLUSIVE set.

Signed-off-by: Andrea Arcangeli <aarcange@redhat.com>
Cc: Dr. David Alan Gilbert <dgilbert@redhat.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Shuah Khan <shuahkh@osg.samsung.com>
Cc: Thierry Reding <treding@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-09-22 15:09:53 -07:00
David S. Miller
99cb99aa05 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-next
Pablo Neira Ayuso says:

====================
Netfilter/IPVS updates for net-next

The following patchset contains Netfilter/IPVS updates for your net-next tree
in this 4.4 development cycle, they are:

1) Schedule ICMP traffic to IPVS instances, this introduces a new schedule_icmp
   proc knob to enable/disable it. By default is off to retain the old
   behaviour. Patchset from Alex Gartrell.

I'm also including what Alex originally said for the record:

"The configuration of ipvs at Facebook is relatively straightforward.  All
ipvs instances bgp advertise a set of VIPs and the network prefers the
nearest one or uses ECMP in the event of a tie.  For the uninitiated, ECMP
deterministically and statelessly load balances by hashing the packet
(usually a 5-tuple of protocol, saddr, daddr, sport, and dport) and using
that number as an index (basic hash table type logic).

The problem is that ICMP packets (which contain really important
information like whether or not an MTU has been exceeded) will get a
different hash value and may end up at a different ipvs instance.  With no
information about where to route these packets, they are dropped, creating
ICMP black holes and breaking Path MTU discovery.  Suddenly, my mom's
pictures can't load and I'm fielding midday calls that I want nothing to do
with.

To address this, this patch set introduces the ability to schedule icmp
packets which is gated by a sysctl net.ipv4.vs.schedule_icmp.  If set to 0,
the old behavior is maintained -- otherwise ICMP packets are scheduled."

2) Add another proc entry to ignore tunneled packets to avoid routing loops
   from IPVS, also from Alex.

3) Fifteen patches from Eric Biederman to:

* Stop passing nf_hook_ops as parameter to the hook and use the state hook
  object instead all around the netfilter code, so only the private data
  pointer is passed to the registered hook function.

* Now that we've got state->net, propagate the netns pointer to netfilter hook
  clients to avoid its computation over and over again. A good example of how
  this has been simplified is the former TEE target (now nf_dup infrastructure)
  since it has killed the ugly pick_net() function.

There's another round of netns updates from Eric Biederman making the line. To
avoid the patchbomb again to almost all the networking mailing list (that is 84
patches) I'd suggest we send you a pull request with no patches or let me know
if you prefer a better way.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-22 13:11:43 -07:00
Sara Sharon
babc305e21 mac80211: reset CQM history upon reconfiguration
The current behavior of notifying CQM events is inconsistent:
Upon first configuration there is a cqm event with the current
status according to threshold configured, regardless of signal
stability.
When there is reconfiguration no event is sent unless there is
a significant change to the signal level according to the new
configuration.

Since the current reconfiguration behavior might cause missing
CQM events in case the current signal did not change but is on
the other side of the new threshold, fix that by resetting the
stored signal level upon reconfiguration.

Signed-off-by: Sara Sharon <sara.sharon@intel.com>
Signed-off-by: Luca Coelho <luciano.coelho@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:22:50 +02:00
Helmut Schaa
4dc792b8f0 mac80211: Split sending tx'ed frames to monitor interfaces into its own function
This allows ieee80211_tx_monitor to be used directly for sending 802.11 frames
to all monitor interfaces.

Signed-off-by: Helmut Schaa <helmut.schaa@googlemail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:28 +02:00
Rafał Miłecki
d55d0d598e nl80211: put current TX power in interface info
Many drivers implement reading current TX power (using either cfg80211
or ieee80211 op) but userspace can't get it using nl80211. Right now the
only way to access it is to call some wext ioctl.
Let's put TX power in interface info reply (callback is wdev specific)
just like we do with current channel.
To be consistent (e.g. NL80211_CMD_SET_WIPHY) let's use mBm as na unit.

Signed-off-by: Rafał Miłecki <zajec5@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:27 +02:00
Johannes Berg
338c17ae31 mac80211: use DECLARE_EWMA for ave_beacon_signal
It doesn't seem problematic to change the weight for the average
beacon signal from 3 to 4, so use DECLARE_EWMA. This also makes
the code easier to maintain since bugs like the one fixed in the
previous patch can't happen as easily.

With a fix from Avraham Stern to invert the sign since EMWA uses
unsigned values only.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:26 +02:00
Johannes Berg
8ec6d97871 mac80211: fix driver RSSI event calculations
The ifmgd->ave_beacon_signal value cannot be taken as is for
comparisons, it must be divided by since it's represented
like that for better accuracy of the EWMA calculations. This
would lead to invalid driver RSSI events. Fix the used value.

Fixes: 615f7b9bb1 ("mac80211: add driver RSSI threshold events")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:26 +02:00
Johannes Berg
8e0d7fe07c mac80211: remove last_beacon/ave_beacon debugfs files
These file aren't really useful:
 - if per beacon data is required then you need to use
   radiotap or similar anyway, debugfs won't help much
 - average beacon signal is reported in station info in
   nl80211 and can be looked up with iw

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:25 +02:00
Bob Copeland
c85fb53c4f mac80211: implement VHT support for mesh
Implement the basics required for supporting very high throughput
with mesh: include VHT information elements in beacons, probe
responses, and peering action frames, and check for compatible VHT
configurations when peering.

Signed-off-by: Bob Copeland <me@bobcopeland.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:25 +02:00
Chun-Yeow Yeoh
f020ae40b0 mac80211: zero center freq segment 2 in VHT oper IE
Clear the Channel Center Frequency Segment 2 in VHT operation
IEs to avoid sending non-zero values if the SKB wasn't zeroed
before adding the VHT operation IE.

Signed-off-by: Chun-Yeow Yeoh <yeohchunyeow@gmail.com>
[change commit message a bit - not necessarily just mesh related]
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:24 +02:00
Emmanuel Grumbach
99e7ca44bb mac80211: allow the driver to advertise A-MSDU within A-MPDU Rx support
Drivers may be interested in receiving A-MSDU within A-MDPU.
Not all the devices may be able to do so, make it configurable.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:24 +02:00
Johannes Berg
46cad4b7a1 mac80211: remove direct probe step before authentication
The direct probe step before authentication was done mostly for
two reasons:
 1) the BSS data could be stale
 2) the beacon might not have included all IEs

The concern (1) doesn't really seem to be relevant any more as
we time out BSS information after about 30 seconds, and in fact
the original patch only did the direct probe if the data was
older than the BSS timeout to begin with. This condition got
(likely inadvertedly) removed later though.

Analysing this in more detail shows that since we mostly use
data from the association response, the only real reason for
needing the probe response was that the code validates the WMM
parameters, and those are optional in beacons. As the previous
patches removed that behaviour, we can now remove the direct
probe step entirely.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:23 +02:00
Emmanuel Grumbach
e3abc8ff0f mac80211: allow to transmit A-MSDU within A-MPDU
Advertise the capability to send A-MSDU within A-MPDU
in the AddBA request sent by mac80211. Let the driver
know about the peer's capabilities.

Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:23 +02:00
Andrei Otcheretianski
1b09b5568e mac80211: introduce per vif frame registration API
Currently the cfg80211's frame registration api receives wdev, however
mac80211 assumes per device filter configuration and ignores wdev.
Per device filtering is too wasteful, especially for multi-channel
devices.
Introduce new per vif frame registration API and use it for probe
request registrations in ieee80211_mgmt_frame_register()
Also call directly to ieee80211_configure_filter instead of using a work
since it is now allowed to sleep in ieee80211_mgmt_frame_register.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:22 +02:00
Johannes Berg
7bdbe400d1 nl80211: support vendor dumpit commands
In order to transfer many items in vendor commands, support the
dumpit netlink method for them.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:22 +02:00
Arik Nemtsov
dd55ab59b6 mac80211: TDLS: check reg with IR-relax on chandef upgrade
When checking if a TDLS chandef can be upgraded, IR-relaxation can be
taken into account to allow more channels.

Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:21 +02:00
Arik Nemtsov
82c0cc90d6 mac80211: debugfs: add file to disallow TDLS wider-bw
Sometimes we are interested in testing TDLS performance in a specific
width setting. Add the ability to disable the wider-band feature, thereby
allowing the TDLS channel width to be controlled by the BSS width.

Signed-off-by: Arik Nemtsov <arikx.nemtsov@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:21 +02:00
Andrei Otcheretianski
fc58c47ef1 mac80211: process skb_queue while scanning in HW
Queued frames aren't processed during scan, which results in an inability
to complete the BA session establishment until the scan ends. Since we
can't tx frames until the BA agreement setup is complete, it might result
in a very large latency during scan.
Fix this by allowing to process queued skbs while scanning in HW. This
should be ok since the devices which support hw scan should be able
to handle tx/rx while scanning.
During SW scan, mac80211 drops any txed frames besides probes and NDPs,
so it is still needed to delay processing of the queued frames till the
SW scan is done.

Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com>
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:20 +02:00
Johannes Berg
8de1c63ba1 wireless: make __freq_reg_info static
As pointed out by sparse, this symbol should be static, make it so.

Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:21:20 +02:00
Johannes Berg
2df1b131b5 mac80211: fix VHT MCS mask array overrun
The HT MCS mask has 9 bytes, the VHT one only has 8 streams.
Split the loops to handle this correctly.

Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
2015-09-22 15:19:48 +02:00
Eric Dumazet
29c6852602 inet: fix races in reqsk_queue_hash_req()
Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.

Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.

Fixes: fa76ce7328 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:32:29 -07:00
Eric Dumazet
ed2e923945 tcp/dccp: fix timewait races in timer handling
When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Ying Cai <ycai@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:32:29 -07:00
Yuchung Cheng
f9b9958229 tcp: send loss probe after 1s if no RTT available
This patch makes TLP to use 1 sec timer by default when RTT is
not available due to SYN/ACK retransmission or SYN cookies.

Prior to this change, the lack of RTT prevents TLP so the first
data packets sent can only be recovered by fast recovery or RTO.
If the fast recovery fails to trigger the RTO is 3 second when
SYN/ACK is retransmitted. With this patch we can trigger fast
recovery in 1sec instead.

Note that we need to check Fast Open more properly. A Fast Open
connection could be (accepted then) closed before it receives
the final ACK of 3WHS so the state is FIN_WAIT_1. Without the
new check, TLP will retransmit FIN instead of SYN/ACK.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Yuchung Cheng
0f1c28ae74 tcp: usec resolution SYN/ACK RTT
Currently SYN/ACK RTT is measured in jiffies. For LAN the SYN/ACK
RTT is often measured as 0ms or sometimes 1ms, which would affect
RTT estimation and min RTT samping used by some congestion control.

This patch improves SYN/ACK RTT to be usec resolution if platform
supports it. While the timestamping of SYN/ACK is done in request
sock, the RTT measurement is carefully arranged to avoid storing
another u64 timestamp in tcp_sock.

For regular handshake w/o SYNACK retransmission, the RTT is sampled
right after the child socket is created and right before the request
sock is released (tcp_check_req() in tcp_minisocks.c)

For Fast Open the child socket is already created when SYN/ACK was
sent, the RTT is sampled in tcp_rcv_state_process() after processing
the final ACK an right before the request socket is released.

If the SYN/ACK was retransmistted or SYN-cookie was used, we rely
on TCP timestamps to measure the RTT. The sample is taken at the
same place in tcp_rcv_state_process() after the timestamp values
are validated in tcp_validate_incoming(). Note that we do not store
TS echo value in request_sock for SYN-cookies, because the value
is already stored in tp->rx_opt used by tcp_ack_update_rtt().

One side benefit is that the RTT measurement now happens before
initializing congestion control (of the passive side). Therefore
the congestion control can use the SYN/ACK RTT.

Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:19:01 -07:00
Ursula Braun
91e60eb60b s390/iucv: do not use arrays as argument
The iucv code uses arrays as arguments. Even though this does not
really cause a problem, it could be misleading, since the compiler
turns array arguments into just a pointer argument. To be more
precise this patch changes the array arguments into pointers.

Signed-off-by: Ursula Braun <ursula.braun@de.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:03:04 -07:00
David S. Miller
5dcd246107 Merge branch 'for-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next
Johan Hedberg says:

====================
pull request: bluetooth-next 2015-09-18

Here's the first bluetooth-next pull request for the 4.4 kernel:

 - ieee802154 cleanups & fixes
 - debugfs support for the at86rf230 driver
 - Support for quirky (seemingly counterfeit) CSR Bluetooth controllers
 - Power management and device config improvements for Intel controllers
 - Fix for devices with incorrect advertising data length
 - Fix for closing HCI user channel socket

Please let me know if there are any issues pulling. Thanks.
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-21 16:00:44 -07:00
Herbert Xu
1f770c0a09 netlink: Fix autobind race condition that leads to zero port ID
The commit c0bb07df7d ("netlink:
Reset portid after netlink_insert failure") introduced a race
condition where if two threads try to autobind the same socket
one of them may end up with a zero port ID.  This led to kernel
deadlocks that were observed by multiple people.

This patch reverts that commit and instead fixes it by introducing
a separte rhash_portid variable so that the real portid is only set
after the socket has been successfully hashed.

Fixes: c0bb07df7d ("netlink: Reset portid after netlink_insert failure")
Reported-by: Tejun Heo <tj@kernel.org>
Reported-by: Linus Torvalds <torvalds@linux-foundation.org>
Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:55:31 -07:00
Nicolas Dichtel
bc22a0e2ea iptunnel: make rx/tx bytes counters consistent
This was already done a long time ago in
commit 64194c31a0 ("inet: Make tunnel RX/TX byte counters more consistent")
but tx path was broken (at least since 3.10).

Before the patch the gre header was included on tx.

After the patch:
$ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms

--- 192.168.0.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
7: gre1@NONE: <POINTOPOINT,NOARP,UP,LOWER_UP> mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/gre 10.16.0.249 peer 10.16.0.121
    RX: bytes  packets  errors  dropped overrun mcast
    84         1        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    84         1        0       0       0       0

Reported-by: Julien Meunier <julien.meunier@6wind.com>
Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:36:22 -07:00
David S. Miller
ac81374493 Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf
Pablo Neira Ayuso says:

====================
Netfilter fixes for net

The following patch contains Netfilter fixes for your net tree, they are:

1) nf_log_unregister() should only set to NULL the logger that is being
   unregistered, instead of everything else. Patch from Florian Westphal.

2) Fix a crash when accessing physoutdev from PREROUTING in br_netfilter.
   This is partially reverting the patch to shrink nf_bridge_info to 32 bytes.
   Also from Florian.

3) Use existing match/target extensions in the internal nft_compat extension
   lists when the extension is family unspecific (ie. NFPROTO_UNSPEC).

4) Wait for rcu grace period before leaving nf_log_unregister().
====================

Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:32:20 -07:00
Erik Hugne
4e3ae00100 tipc: reinitialize pointer after skb linearize
The msg pointer into header may change after skb linearization.
We must reinitialize it after calling skb_linearize to prevent
operating on a freed or invalid pointer.

Signed-off-by: Erik Hugne <erik.hugne@ericsson.com>
Reported-by: Tamás Végh <tamas.vegh@ericsson.com>
Acked-by: Ying Xue <ying.xue@windriver.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
2015-09-20 22:31:20 -07:00